[libvirt] [PATCH 4/9] util: cgroups do not implicitly add task to new machine cgroup

Henning Schild henning.schild at siemens.com
Fri Feb 26 13:17:35 UTC 2016


On Fri, 26 Feb 2016 13:00:04 +0000
"Daniel P. Berrange" <berrange at redhat.com> wrote:

> On Fri, Feb 26, 2016 at 01:43:05PM +0100, Henning Schild wrote:
> > On Fri, 26 Feb 2016 12:21:02 +0000
> > "Daniel P. Berrange" <berrange at redhat.com> wrote:
> >   
> > > On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:  
> > > > On Fri, 26 Feb 2016 11:13:13 +0000
> > > > "Daniel P. Berrange" <berrange at redhat.com> wrote:
> > > >     
> > > > > On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild
> > > > > wrote:    
> > > > > > virCgroupNewMachine used to add the pidleader to the newly
> > > > > > created machine cgroup. Do not do this implicit anymore.
> > > > > > 
> > > > > > Signed-off-by: Henning Schild <henning.schild at siemens.com>
> > > > > > ---
> > > > > >  src/lxc/lxc_cgroup.c   | 11 +++++++++++
> > > > > >  src/qemu/qemu_cgroup.c | 11 +++++++++++
> > > > > >  src/util/vircgroup.c   | 22 ----------------------
> > > > > >  3 files changed, 22 insertions(+), 22 deletions(-)      
> > > > > 
> > > > > NACK to this patch once again.
> > > > > 
> > > > > This does not actually work as you think it does.
> > > > >     
> > > > > > diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c
> > > > > > index 11f33ab..aef8e8c 100644
> > > > > > --- a/src/util/vircgroup.c
> > > > > > +++ b/src/util/vircgroup.c
> > > > > > @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char
> > > > > > *name, }
> > > > > >      }
> > > > > >  
> > > > > > -    if (virCgroupAddTask(*group, pidleader) < 0) {
> > > > > > -        virErrorPtr saved = virSaveLastError();
> > > > > > -        virCgroupRemove(*group);
> > > > > > -        virCgroupFree(group);
> > > > > > -        if (saved) {
> > > > > > -            virSetError(saved);
> > > > > > -            virFreeError(saved);
> > > > > > -        }
> > > > > > -    }      
> > > > > 
> > > > > Just above this we called virSystemdCreateMachine.  Systemd
> > > > > will create the cgroup and add the pidleader to those
> > > > > cgroups. Systemd may add the pidleader to just the 'systemd'
> > > > > controller, or it may add the pidleader to *ALL* controllers.
> > > > > We have no way of knowing.
> > > > > 
> > > > > This virCgroupAddTask call deals with whatever systemd chose
> > > > > not todo, so we can guarantee consistent behaviour with the
> > > > > pidleader in all cgroups.
> > > > > 
> > > > > By removing this you make this method non-deterministic - the
> > > > > pid may or may not be in the cpu controller now. THis is bad
> > > > > because it can lead to QEMU/LXC driver code working in some
> > > > > cases but failing in other cases.
> > > > > 
> > > > > Furthermore, this existing does not cause any problems for the
> > > > > scenario you care about. THis cgroup placement is being set
> > > > > in between the time libvirtd calls fork() and exec(). With
> > > > > your later patch 5, we ensure that the PID is moved across
> > > > > into the emulator cgroup, before we call exec(). When we call
> > > > > exec all memory mappings will be replaced, so QEMU will stil
> > > > > start with the correct vCPU placement and memory allocation
> > > > > placement.    
> > > > 
> > > > I agree having the task in the wrong cgroup before the exec()
> > > > seems harmless. But i am not sure all the fiddling with cgroups
> > > > is indeed harmless and does not cause i.e. kernel work on cores
> > > > that should be left alone. I have the feeling allowing the task
> > > > in the parent cgroup is a bad idea, no matter how short the
> > > > window seems to be.
> > > > 
> > > > Right now the parent cgroup contains all cpus found in
> > > > machine.slice, which for pinned VMs is too much. How about we
> > > > calculate the size of the child cgroups before and make the
> > > > parent the union of them. Or give the parent the emulator
> > > > pinning and extend it for the vcpus later. But that might turn
> > > > out pretty complicated as well, getting the order right with
> > > > the mix of cpusets and sched_setaffinity().   
> > > > > Just just drop this patch please.    
> > > 
> > > The point is though that we have *no* choice. Systemd can put the
> > > task in the cpu controller and we've no way to prevent that. So
> > > the code *has* to be able to cope with that happening. Therefore
> > > this patch is wrong it just makes behaviour non-deterministic
> > > increasing the chances that we don't correctly handle the case
> > > where systemd adds the task to the cpu controllers  
> > 
> > Understood! I was suggesting a "growing on demand" policy instead of
> > "shrinking after inheriting all".
> > 
> > If we can not control what systemd does we have to give it harmless
> > cpus to mess around with. That assumes we can control the size of
> > the cpuset before systemd puts anything in. Once back in control we
> > grow the parent group before deriving more child groups.
> > 
> > Would that be possible?
> > 
> > I have no objections to keep using the shrinking approach.
> > Especially since the controlled growing is harder to implement in
> > the given codebase. It just feels like it should be the other way
> > around.  
> 
> I don't see what actual problem you are trying to solve here.
> 
> 
> IIUC, the original problem you wanted to address was that vCPU pids
> could run the wrong CPU for a short time. ie The original code did
> 
>   1. Libvirtd forks child pid
>      ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... 2. Libvirtd creates root cgroup for VM
>      ... this pid runs on whatever pCPUs the root cgroup inherited...
>   3. Child pid execs QEMU
>      ... QEMU pid runs on whatever pCPUs the root cgroup inherited...
>   4. QEMU creates vCPU pids
>      ... vCPU pids run on whatever pCPUs the root cgroup inherited...
>   4. Libvirtd moves emulator PIDs and vCPU PIDs
>      ... emulator PIDs are running on assigned pCPUs for emulator...
>      ... vCPU PIDs are running on assigned pCPUs for vCPUs....
> 
> With the important change in patch 5 this now looks like
> 
>   1. Libvirtd forks child pid
>      ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... 2. Libvirtd creates root cgroup for VM

>      ... this pid runs on whatever pCPUs the root cgroup inherited...

I am trying to come up with a solution that eliminates the above line
from the whole bringup. I.e never allow a pid belonging to the VM
outside the pinnings of libvirtd and the VM configuration.

But until step 4 it should probably be
"... this pid *just sits* on whatever pCPUs the root cgroup
inherited..."
If we are sure that it does not "run" before 4. patch 5 does the trick
already

>   3. Libvirtd moves pid into emulator group
>      ... this pid runs on assigned pCPUs for emulator...
>   4. Child pid execs QEMU
>      ... QEMU pid runs on assigned pCPUs for emulator...
>   5. QEMU creates vCPU pids
>      ... vCPU pids are running on assigned pCPUS for emulator...
>   6. Libvirtd moves vCPU PIDs
>      ... emulator PIDs are running on assigned pCPUs for emulator...
>      ... vCPU PIDs are running on assigned pCPUs for vCPUs....
> 
> Which is good, because vCPU pids don't ever run on un-restricted
> pCPUs.
> 
> 
> Your patch 4 here is attempting to change step 2 only so that it
> looks like
> 
> 
>   1. Libvirtd forks child pid
>      ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... 2. Libvirtd creates root cgroup for VM
>      ... this pid runs on whatever pCPUs that libvirt is permitted to
> use... or depending on what controller system added
>      ... this pid runs on whatever pCPUs the root cgroup inherited...
>   3. Libvirtd adds pid into emulator group
>      ... this pid runs on assigned pCPUs for emulator...
>   4. Child pid execs QEMU
>      ... QEMU pid runs on assigned pCPUs for emulator...
>   5. QEMU creates vCPU pids
>      ... vCPU pids are running on assigned pCPUS for emulator...
>   6. Libvirtd moves vCPU PIDs
>      ... emulator PIDs are running on assigned pCPUs for emulator...
>      ... vCPU PIDs are running on assigned pCPUs for vCPUs....
> 
> At the time we exec QEMU in step 4 the situation is exactly the same
> as before. The vCPU pids are still created in the right place
> straight away.
> 
> So this patch 4 doesn't achieve anything useful
> 
> Regards,
> Daniel




More information about the libvir-list mailing list