[libvirt] [PATCH 4/9] util: cgroups do not implicitly add task to new machine cgroup
Henning Schild
henning.schild at siemens.com
Fri Feb 26 13:17:35 UTC 2016
On Fri, 26 Feb 2016 13:00:04 +0000
"Daniel P. Berrange" <berrange at redhat.com> wrote:
> On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote:
> > On Fri, 26 Feb 2016 12:21:02 +0000
> > "Daniel P. Berrange" <berrange at redhat.com> wrote:
> >
> > > On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:
> > > > On Fri, 26 Feb 2016 11:13:13 +0000
> > > > "Daniel P. Berrange" <berrange at redhat.com> wrote:
> > > >
> > > > > On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild
> > > > > wrote:
> > > > > > virCgroupNewMachine used to add the pidleader to the newly
> > > > > > created machine cgroup. Do not do this implicit anymore.
> > > > > >
> > > > > > Signed-off-by: Henning Schild <henning.schild at siemens.com>
> > > > > > ---
> > > > > > src/lxc/lxc_cgroup.c | 11 +++++++++++
> > > > > > src/qemu/qemu_cgroup.c | 11 +++++++++++
> > > > > > src/util/vircgroup.c | 22 ----------------------
> > > > > > 3 files changed, 22 insertions(+), 22 deletions(-)
> > > > >
> > > > > NACK to this patch once again.
> > > > >
> > > > > This does not actually work as you think it does.
> > > > >
> > > > > > diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c
> > > > > > index 11f33ab..aef8e8c 100644
> > > > > > --- a/src/util/vircgroup.c
> > > > > > +++ b/src/util/vircgroup.c
> > > > > > @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char
> > > > > > *name, }
> > > > > > }
> > > > > >
> > > > > > - if (virCgroupAddTask(*group, pidleader) < 0) {
> > > > > > - virErrorPtr saved = virSaveLastError();
> > > > > > - virCgroupRemove(*group);
> > > > > > - virCgroupFree(group);
> > > > > > - if (saved) {
> > > > > > - virSetError(saved);
> > > > > > - virFreeError(saved);
> > > > > > - }
> > > > > > - }
> > > > >
> > > > > Just above this we called virSystemdCreateMachine. Systemd
> > > > > will create the cgroup and add the pidleader to those
> > > > > cgroups. Systemd may add the pidleader to just the 'systemd'
> > > > > controller, or it may add the pidleader to *ALL* controllers.
> > > > > We have no way of knowing.
> > > > >
> > > > > This virCgroupAddTask call deals with whatever systemd chose
> > > > > not todo, so we can guarantee consistent behaviour with the
> > > > > pidleader in all cgroups.
> > > > >
> > > > > By removing this you make this method non-deterministic - the
> > > > > pid may or may not be in the cpu controller now. THis is bad
> > > > > because it can lead to QEMU/LXC driver code working in some
> > > > > cases but failing in other cases.
> > > > >
> > > > > Furthermore, this existing does not cause any problems for the
> > > > > scenario you care about. THis cgroup placement is being set
> > > > > in between the time libvirtd calls fork() and exec(). With
> > > > > your later patch 5, we ensure that the PID is moved across
> > > > > into the emulator cgroup, before we call exec(). When we call
> > > > > exec all memory mappings will be replaced, so QEMU will stil
> > > > > start with the correct vCPU placement and memory allocation
> > > > > placement.
> > > >
> > > > I agree having the task in the wrong cgroup before the exec()
> > > > seems harmless. But i am not sure all the fiddling with cgroups
> > > > is indeed harmless and does not cause i.e. kernel work on cores
> > > > that should be left alone. I have the feeling allowing the task
> > > > in the parent cgroup is a bad idea, no matter how short the
> > > > window seems to be.
> > > >
> > > > Right now the parent cgroup contains all cpus found in
> > > > machine.slice, which for pinned VMs is too much. How about we
> > > > calculate the size of the child cgroups before and make the
> > > > parent the union of them. Or give the parent the emulator
> > > > pinning and extend it for the vcpus later. But that might turn
> > > > out pretty complicated as well, getting the order right with
> > > > the mix of cpusets and sched_setaffinity().
> > > > > Just just drop this patch please.
> > >
> > > The point is though that we have *no* choice. Systemd can put the
> > > task in the cpu controller and we've no way to prevent that. So
> > > the code *has* to be able to cope with that happening. Therefore
> > > this patch is wrong it just makes behaviour non-deterministic
> > > increasing the chances that we don't correctly handle the case
> > > where systemd adds the task to the cpu controllers
> >
> > Understood! I was suggesting a "growing on demand" policy instead of
> > "shrinking after inheriting all".
> >
> > If we can not control what systemd does we have to give it harmless
> > cpus to mess around with. That assumes we can control the size of
> > the cpuset before systemd puts anything in. Once back in control we
> > grow the parent group before deriving more child groups.
> >
> > Would that be possible?
> >
> > I have no objections to keep using the shrinking approach.
> > Especially since the controlled growing is harder to implement in
> > the given codebase. It just feels like it should be the other way
> > around.
>
> I don't see what actual problem you are trying to solve here.
>
>
> IIUC, the original problem you wanted to address was that vCPU pids
> could run the wrong CPU for a short time. ie The original code did
>
> 1. Libvirtd forks child pid
> ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... 2. Libvirtd creates root cgroup for VM
> ... this pid runs on whatever pCPUs the root cgroup inherited...
> 3. Child pid execs QEMU
> ... QEMU pid runs on whatever pCPUs the root cgroup inherited...
> 4. QEMU creates vCPU pids
> ... vCPU pids run on whatever pCPUs the root cgroup inherited...
> 4. Libvirtd moves emulator PIDs and vCPU PIDs
> ... emulator PIDs are running on assigned pCPUs for emulator...
> ... vCPU PIDs are running on assigned pCPUs for vCPUs....
>
> With the important change in patch 5 this now looks like
>
> 1. Libvirtd forks child pid
> ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... 2. Libvirtd creates root cgroup for VM
> ... this pid runs on whatever pCPUs the root cgroup inherited...
I am trying to come up with a solution that eliminates the above line
from the whole bringup. I.e never allow a pid belonging to the VM
outside the pinnings of libvirtd and the VM configuration.
But until step 4 it should probably be
"... this pid *just sits* on whatever pCPUs the root cgroup
inherited..."
If we are sure that it does not "run" before 4. patch 5 does the trick
already
> 3. Libvirtd moves pid into emulator group
> ... this pid runs on assigned pCPUs for emulator...
> 4. Child pid execs QEMU
> ... QEMU pid runs on assigned pCPUs for emulator...
> 5. QEMU creates vCPU pids
> ... vCPU pids are running on assigned pCPUS for emulator...
> 6. Libvirtd moves vCPU PIDs
> ... emulator PIDs are running on assigned pCPUs for emulator...
> ... vCPU PIDs are running on assigned pCPUs for vCPUs....
>
> Which is good, because vCPU pids don't ever run on un-restricted
> pCPUs.
>
>
> Your patch 4 here is attempting to change step 2 only so that it
> looks like
>
>
> 1. Libvirtd forks child pid
> ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... 2. Libvirtd creates root cgroup for VM
> ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... or depending on what controller system added
> ... this pid runs on whatever pCPUs the root cgroup inherited...
> 3. Libvirtd adds pid into emulator group
> ... this pid runs on assigned pCPUs for emulator...
> 4. Child pid execs QEMU
> ... QEMU pid runs on assigned pCPUs for emulator...
> 5. QEMU creates vCPU pids
> ... vCPU pids are running on assigned pCPUS for emulator...
> 6. Libvirtd moves vCPU PIDs
> ... emulator PIDs are running on assigned pCPUs for emulator...
> ... vCPU PIDs are running on assigned pCPUs for vCPUs....
>
> At the time we exec QEMU in step 4 the situation is exactly the same
> as before. The vCPU pids are still created in the right place
> straight away.
>
> So this patch 4 doesn't achieve anything useful
>
> Regards,
> Daniel
More information about the libvir-list
mailing list