[libvirt] [REPOST 0/4] Adjustment to recent cgroup/cpuset changes (for 1.3.1)

Henning Schild henning.schild at siemens.com
Thu Jan 14 12:12:41 UTC 2016


On Thu, 14 Jan 2016 11:57:44 +0000
"Daniel P. Berrange" <berrange at redhat.com> wrote:

> On Wed, Jan 13, 2016 at 07:29:46AM -0500, John Ferlan wrote:
> > Reposting my cgroup fixes series:
> > 
> > http://www.redhat.com/archives/libvir-list/2016-January/msg00236.html
> > 
> > partially because I originally forgot to CC the author (Henning
> > Schild) of the original series for which these patch fix a couple
> > of issues discovered during regression testing (virt-test memtune
> > failures in Red Hat regression environment), but also to bring them
> > up to date with the top of libvirt git.
> > 
> > NB: I did send Henning the changes after the fact, but my resend
> > using the same message-id skills so that replies are left in the
> > onlist series are lacking.  Henning has looked at the first patch -
> > with a response here:
> > 
> > http://www.redhat.com/archives/libvir-list/2016-January/msg00443.html
> > 
> > Finally, I think these changes should go into 1.3.1 since that's
> > when the regression was introduced.  
> 
> Since this has been puzzelling us for a while, let me recap on the
> cgroup setup in general.
> 
> First, I'll describe how it used to work *before* Henning's patches
> were merged, on a systemd based host.
> 
>  - The QEMU driver forks a child process, but does *not* exec QEMU yet
>    The cgroup placement at this point is inherited from libvirtd. It
>    may look like this:
> 
>      10:freezer:/
>      9:cpuset:/
>      8:perf_event:/
>      7:hugetlb:/
>      6:blkio:/system.slice
>      5:memory:/system.slice
>      4:net_cls,net_prio:/
>      3:devices:/system.slice/libvirtd.service
>      2:cpu,cpuacct:/system.slice
>      1:name=systemd:/system.slice/libvirtd.service
> 
>  - The QEMU driver calls virCgroupNewMachine()
> 
>       - We calll virSystemdCreateMachine with pidleader=$child
> 
>            - Systemd creates the initial machine scope unit under
> 	     the machine slice unit, for the "systemd" controller.
> 	     It may also add the PID to *zero* or more other
> 	     resource controllers. So at this point the cgroup
> 	     placement may look like this:
> 
>               10:freezer:/
>               9:cpuset:/
>               8:perf_event:/
>               7:hugetlb:/
>               6:blkio:/
>               5:memory:/
>               4:net_cls,net_prio:/
>               3:devices:/
>               2:cpu,cpuacct:/
>               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> 
>              Or may look like this:
> 
>               10:freezer:/machine.slice/machine-qemu\x2dserial.scope
>               9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
>               8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
>               7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
>               6:blkio:/machine.slice/machine-qemu\x2dserial.scope
>               5:memory:/machine.slice/machine-qemu\x2dserial.scope
>               4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
>               3:devices:/machine.slice/machine-qemu\x2dserial.scope
>               2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
>               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> 
>              Or anywhere in between. We have *ZERO* guarantee about
> what other resource controllers we may have been placed in by
> 	     systemd. There is some fairly complex logic that
> determines this, based on what other tasks current exist in sibling
> 	     cgroups, and what tasks have *previously* existed in the
> 	     cgroups. IOW, you should consider the list of etra
> resource controllers essentially non-deterministic
> 
>       - We call virCgroupAddTask with pid=$child
> 
>         This places the pid in any resource controllers we need, which
> 	systemd has not already setup. IOW, it guarantees that we now
> 	have placement that should look like this, regardless of what
> 	systemd has done:
> 
>               10:freezer:/machine.slice/machine-qemu\x2dserial.scope
>               9:cpuset:/machine.slice/machine-qemu\x2dserial.scope
>               8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
>               7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
>               6:blkio:/machine.slice/machine-qemu\x2dserial.scope
>               5:memory:/machine.slice/machine-qemu\x2dserial.scope
>               4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
>               3:devices:/machine.slice/machine-qemu\x2dserial.scope
>               2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope
>               1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> 
>  - The QEMU driver now lets the child process exec QEMU. QEMU creates
>    its vCPU threads at this point. All QEMU threads (emulator, vcpu
>    and I/O threads) now have the cgroup placement shown above.
> 
>  - We create the emulator cgroup for the cpuset, cpu, cpuacct
> controllers move all threads into this new cgroup. All threads
> (emulator, vcpu and I/O threads) thus now have placement of:
> 
>            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
>            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/emulator
>            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
>            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
>            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
>            5:memory:/machine.slice/machine-qemu\x2dserial.scope
>            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
>            3:devices:/machine.slice/machine-qemu\x2dserial.scope
>            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/emulator
>            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> 
>    Yes, we really did move the vcpu threads into the emulator group...
> 
>  - We now ask QEMU which are the vCPU & I/O threads.
> 
>     - Foreach CPU thread we new vCPU cgroups and move them into this
>       place
> 
>            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
>            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/vcpuN
>            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
>            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
>            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
>            5:memory:/machine.slice/machine-qemu\x2dserial.scope
>            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
>            3:devices:/machine.slice/machine-qemu\x2dserial.scope
>            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/vpuN
>            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> 
>     - Foreach I/O thread we new vCPU cgroups and move them into this
>       place
> 
>            10:freezer:/machine.slice/machine-qemu\x2dserial.scope
>            9:cpuset:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
>            8:perf_event:/machine.slice/machine-qemu\x2dserial.scope
>            7:hugetlb:/machine.slice/machine-qemu\x2dserial.scope
>            6:blkio:/machine.slice/machine-qemu\x2dserial.scope
>            5:memory:/machine.slice/machine-qemu\x2dserial.scope
>            4:net_cls,net_prio:/machine.slice/machine-qemu\x2dserial.scope
>            3:devices:/machine.slice/machine-qemu\x2dserial.scope
>            2:cpu,cpuacct:/machine.slice/machine-qemu\x2dserial.scope/iothreadN
>            1:name=systemd:/machine.slice/machine-qemu\x2dserial.scope
> 
> 
> 
> Now, lets Henning's three patches
> 
> 
>    commit 71ce475967b3523882b0e1a0f649fdbea5abf9d5
>    Author: Henning Schild <henning.schild at siemens.com>
>    Date:   Wed Dec 9 17:58:10 2015 +0100
> 
>     util: cgroups do not implicitly add task to new machine cgroup
> 
> This alters virCgroupNewMachine so that it no longer calls
> virCgroupAddTask. instead the callers are responsible for doing it.
> This is a functional no-opp change, so ostensibly harmless at this
> point.
> 
> The second patch
> 
>   commit a41c00b472efaa192d2deae51ab732e65903238f
>   Author: Henning Schild <henning.schild at siemens.com>
>   Date:   Mon Dec 14 15:48:05 2015 -0500
> 
>     qemu: do not put a task into machine cgroup
> 
> This stops qemuInitCgroup from calling virCgroupAddTask. Instead it
> has added a call to vriCgroupAddTask in qemuSetupCgroupForEmulator(),
> which only affects the cpu, cpuset & cpuacct controllers.
> 
> This means we no longer have any guarantee about which resource
> controllers the QEMU processes in general are in. All we can say is
> that they are in the 'systemd' controller, and the cpu, cpuset &
> cpuacct controllers. Systemd may have also placed it into *all* the
> other resource controllers, or it may have placed it into none of the
> them.
> 
> As such, after this change, we are potentially not in the correct
> cgroup for the memory, blkio, netcls, devices controllers.
> 
> For added fun, the qemuSetupCgroupForEmulator() still has a call to
> virCgroupMoveTask, so the addition of virCgroupAddTask to this method
> was useless.
> 
> 
> The final patch
> 
>   commit 90b721e43ec9232b5b218e891437bed04548e841
>   Author: Henning Schild <henning.schild at siemens.com>
>   Date:   Mon Dec 14 15:58:05 2015 -0500
> 
>     qemu cgroups: move new threads to new cgroup after cpuset is set
> up
> 
> This delays the point at which we call virCgroupAddTask for the
> vCPU and I/O thread cgroups, until after we have configured properties
> on those cgroups.  This change is fine
> 
> 
> So I think we need to revert the first 2 of Hennings patches - the 2nd
> one is the real serious problem, but once we revert that, the 1st
> change becomes pointless and so should also be reverted.
> 
> 
> Going back to the key problem Henning was trying to address....
> 
> The issue is that we only setup the emulator cgroup /after/ QEMU has
> already started running, so there's a period of time where threads are
> not confined by the cgroup affinity.
> 
> I think we can solve that more simply by just moving the call to
> qemuSetupCgroupForEmulator(), into qemuSetupCgroup(). That way we
> place the emulator thread into the correct cgroup *before* we even
> exec QEMU. So we never have the window where we run in the unconfined
> cpuset controller.

Thanks for that very thorough analysis and description, very helpful
indeed! I fully agree with your conclusion and suggestion on how the
code should be changed.

Henning




More information about the libvir-list mailing list