[libvirt] [PATCH 4/9] util: cgroups do not implicitly add task to new machine cgroup
Daniel P. Berrange
berrange at redhat.com
Fri Feb 26 12:21:02 UTC 2016
On Fri, Feb 26, 2016 at 01:16:15PM +0100, Henning Schild wrote:
> On Fri, 26 Feb 2016 11:13:13 +0000
> "Daniel P. Berrange" <berrange at redhat.com> wrote:
>
> > On Tue, Feb 23, 2016 at 04:58:39PM +0100, Henning Schild wrote:
> > > virCgroupNewMachine used to add the pidleader to the newly created
> > > machine cgroup. Do not do this implicit anymore.
> > >
> > > Signed-off-by: Henning Schild <henning.schild at siemens.com>
> > > ---
> > > src/lxc/lxc_cgroup.c | 11 +++++++++++
> > > src/qemu/qemu_cgroup.c | 11 +++++++++++
> > > src/util/vircgroup.c | 22 ----------------------
> > > 3 files changed, 22 insertions(+), 22 deletions(-)
> >
> > NACK to this patch once again.
> >
> > This does not actually work as you think it does.
> >
> > > diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c
> > > index 11f33ab..aef8e8c 100644
> > > --- a/src/util/vircgroup.c
> > > +++ b/src/util/vircgroup.c
> > > @@ -1682,16 +1682,6 @@ virCgroupNewMachineSystemd(const char *name,
> > > }
> > > }
> > >
> > > - if (virCgroupAddTask(*group, pidleader) < 0) {
> > > - virErrorPtr saved = virSaveLastError();
> > > - virCgroupRemove(*group);
> > > - virCgroupFree(group);
> > > - if (saved) {
> > > - virSetError(saved);
> > > - virFreeError(saved);
> > > - }
> > > - }
> >
> > Just above this we called virSystemdCreateMachine. Systemd will
> > create the cgroup and add the pidleader to those cgroups. Systemd
> > may add the pidleader to just the 'systemd' controller, or it may
> > add the pidleader to *ALL* controllers. We have no way of knowing.
> >
> > This virCgroupAddTask call deals with whatever systemd chose not
> > todo, so we can guarantee consistent behaviour with the pidleader
> > in all cgroups.
> >
> > By removing this you make this method non-deterministic - the pid
> > may or may not be in the cpu controller now. THis is bad because
> > it can lead to QEMU/LXC driver code working in some cases but
> > failing in other cases.
> >
> > Furthermore, this existing does not cause any problems for the
> > scenario you care about. THis cgroup placement is being set
> > in between the time libvirtd calls fork() and exec(). With your
> > later patch 5, we ensure that the PID is moved across into the
> > emulator cgroup, before we call exec(). When we call exec all
> > memory mappings will be replaced, so QEMU will stil start with
> > the correct vCPU placement and memory allocation placement.
>
> I agree having the task in the wrong cgroup before the exec() seems
> harmless. But i am not sure all the fiddling with cgroups is indeed
> harmless and does not cause i.e. kernel work on cores that should be
> left alone. I have the feeling allowing the task in the parent cgroup
> is a bad idea, no matter how short the window seems to be.
>
> Right now the parent cgroup contains all cpus found in machine.slice,
> which for pinned VMs is too much. How about we calculate the size of the
> child cgroups before and make the parent the union of them. Or give the
> parent the emulator pinning and extend it for the vcpus later.
> But that might turn out pretty complicated as well, getting the order
> right with the mix of cpusets and sched_setaffinity().
>
> > Just just drop this patch please.
The point is though that we have *no* choice. Systemd can put the task
in the cpu controller and we've no way to prevent that. So the code *has*
to be able to cope with that happening. Therefore this patch is wrong
it just makes behaviour non-deterministic increasing the chances that
we don't correctly handle the case where systemd adds the task to the
cpu controllers
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
More information about the libvir-list
mailing list