[PATCH 1/1] vircgroup: Cleanup nested cgroups

Pavel Hrdina phrdina at redhat.com
Fri Apr 9 10:02:22 UTC 2021


On Thu, Apr 08, 2021 at 10:00:32PM -0400, Eric Farman wrote:
> 
> 
> On 4/8/21 8:12 AM, Pavel Hrdina wrote:
> > On Wed, Apr 07, 2021 at 11:28:48PM -0400, Eric Farman wrote:
> > > 
> > > 
> > > On 4/7/21 9:07 AM, Pavel Hrdina wrote:
> > > > On Fri, Mar 26, 2021 at 05:25:03PM +0100, Eric Farman wrote:
> > > > > The introduction of nested cgroups used a little macro
> > > > > virCgroupGetNested() to retrieve the nested cgroup
> > > > > pointer, if one exists. But this macro isn't used when
> > > > > removing cgroups, resulting in some messages:
> > > > > 
> > > > >     Mar 25 20:55:17 fedora33 libvirtd[955]: unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory
> > > > >     Mar 25 20:55:17 fedora33 libvirtd[955]: Failed to remove cgroup for guest
> > > > > 
> > > > > That directory exists while the guest is running, as it
> > > > > was created by systemd/machined, so the code probably meant
> > > > > to open the libvirt/ subdirectory from that point.
> > > > > 
> > > > > Similarly, there happen to be BPF-related file descriptors
> > > > > that don't get cleaned up in this process too, because they
> > > > > are anchored off the nested cgroup location:
> > > > > 
> > > > >     [test at fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l
> > > > >     35
> > > > >     [test at fedora33 ~]# virsh create guest.xml
> > > > >     Domain 'guest' created from guest.xml
> > > > > 
> > > > >     [test at fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l
> > > > >     42
> > > > >     [test at fedora33 ~]# virsh shutdown guest
> > > > >     Domain 'guest' is being shutdown
> > > > > 
> > > > >     [test at fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l
> > > > >     37
> > > > >     [test at fedora33 ~]# virsh create guest.xml
> > > > >     Domain 'guest' created from guest.xml
> > > > > 
> > > > >     [test at fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l
> > > > >     44
> > > > >     [test at fedora33 ~]# virsh shutdown guest
> > > > >     Domain 'guest' is being shutdown
> > > > > 
> > > > >     [test at fedora33 ~]# ls /proc/$(pgrep libvirtd)/fd/* | wc -l
> > > > >     39
> > > > > 
> > > > > Let's fix this by using the same macro when removing cgroups,
> > > > > so that it picks up the right structure and can remove the
> > > > > associated resources properly.
> > > > > 
> > > > > Fixes: 184245f53b94 ("vircgroup: introduce nested cgroup to properly work with systemd")
> > > > > Signed-off-by: Eric Farman <farman at linux.ibm.com>
> > > > > ---
> > > > >    src/util/vircgroup.c | 5 +++--
> > > > >    1 file changed, 3 insertions(+), 2 deletions(-)
> > > > 
> > > > I don't thing this patch is correct. With systemd we would get the same
> > > > error without the nested cgroup as well. It's because we terminate the
> > > > qemu process which makes systemd remove the VM root cgroup as well.
> > > 
> > > I don't experience any problems reverting the blamed patch. The qemu cgroup
> > > code doesn't make any distinction about systemd or not; it just calls the
> > > virCgroupRemove() to clean up the resources that were set up here during
> > > init:
> > > 
> > > qemuInitCgroup()
> > >    virCgroupNewMachine()
> > >      virCgroupNewMachineSystemd()
> > >        virCgroupNewNested()
> > > 
> > > The group pointer that's stashed in qemu's struct is that of the
> > > machine-qemu...scope group, rather than the nested group, but nothing in the
> > > cleanup path touches group->nested. My initial patch is certainly flawed (as
> > > you explain below), so maybe something like this is better?
> > > 
> > > @@ -2615,6 +2615,9 @@ virCgroupRemove(virCgroupPtr group)
> > >   {
> > >       size_t i;
> > > 
> > > +    if (group->nested)
> > > +       virCgroupRemove(group->nested);
> > > +
> > >       for (i = 0; i < VIR_CGROUP_BACKEND_TYPE_LAST; i++) {
> > >           if (group->backends[i]) {
> > >               int rc = group->backends[i]->remove(group);
> > > 
> > > Not great, since it cleans up the nested group but then still attempts to
> > > clean up the machine-qemu...scope group that was setup by systemd. This
> > > group wasn't setup by virCgroupV2MakeGroup(), so calling virCgroupV2Remove()
> > > seems wrong too. Not sure how to address that.
> > 
> > I'm not sure how this will help. As I've already pointed out calling
> > virCgroupRemove(group) results in calling one or both functions:
> > 
> >      virCgroupV1Remove()
> >      virCgroupV2Remove()
> > 
> > Where both of these functions will call virCgroupRemoveRecursively()
> > which will recursively remove all subdirectories including the nested
> > one.
> > 
> > So the extra
> > 
> >      if (group->nested)
> >          virCgroupRemove(group->nested);
> > 
> > will not help with anything.
> > 
> > Looking at the code (I did not test it) it looks like the error message
> > is produced by following this path:
> > 
> > qemuProcessStop()
> >    qemuRemoveCgroup()
> >      virCgroupTerminateMachine() this will make systemd to remove the cgroup
> >      virCgroupRemove()
> >        virCgroupV2Remove()
> >          virCgroupV2DevicesRemoveProg()
> >            virCgroupV2DevicesDetectProg()
> >              open()
> 
> Yes, this is the path where exit out, and thus never get to the
> RemoveRecursively routine you mentioned above. But it's not because of the
> virCgroupTerminateMachine() call, but rather the progfd and mapfd fields in
> the cgroup passed to virCgroupV2DevicesDetectProg().
> 
> With the blamed patch, these fields are zero, so we go ahead and try to do
> all that other work. Prior to that patch, and with my proposed patch, these
> fd's are non-zero, and so it exits immediately back to
> virCgroupV2DevicesRemoveProg() which does a VIR_FORCE_CLOSE on them. The
> non-zero fd's are related to BPF (see below), and are stashed in that nested
> cgroup nowadays.
> 
> (virsh create/shutdown a guest three times)
> # readlink /proc/$(pgrep libvirtd)/fd/* | grep bpf
> anon_inode:bpf-map
> anon_inode:bpf-map
> anon_inode:bpf-prog
> anon_inode:bpf-map
> anon_inode:bpf-prog
> anon_inode:bpf-prog

I still fail to see how calling virCgroupRemove(group->nested) in
virCgroupRemove() would help at all. The original issue you mentioned in
the commit message is that we log this error:

    unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory

So calling virCgroupRemove(group->nested) would also fail with:

    unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/libvirt/': No such file or directory

because if the VM root cgroup ('machine-qemu\x2d1\x2dguest.scope')
doesn't exist the nested cgroup ('libvirt') will not exist as well.

Now to the progfd and mapfd being zero, that depends whether libvirtd
was restarted while the VM was running. When a VM is started on host
with cgroups v2 libvirt will create BPF program and BPF map to limit
access to system devices and stores both in the mentioned variables.
But if you restart libvirtd it will load the BPF prog ID and BPF map ID
only on demand, for example when destroying the VM.

Now on hosts with systemd the owner of the VM root cgroup is systemd
and because we use call talk to systemd-machined and register the VM
there the VM root cgroup is created for us by systemd-machined and it is
associated with qemu PID. When destroying the VM we will kill the qemu
process which will trigger systemd-machined to automatically cleanup the
cgroup. Once that happens kernel should eventually cleanup both BPF prog
and BPF map that were associated with the nested cgroup because it no
longer exist and there are no other references to the prog or map.
It may take some time before kernel actually removes the prog and map.

The only thing on systemd based host we can do is to skip the whole BPF
detection code if the directory was already removed which I've already
proposed by handling ENOENT in virCgroupV2DevicesDetectProg().

All of the above applies to systemd based hosts. If we consider system
without systemd then there is an actual bug as you already pointed out
that the BPF prog and BPF map are now associated with the nested cgroup,
to fix that we should change only virCgroupV2Remove:

diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c
index 248d4047e5..4664492c34 100644
--- a/src/util/vircgroupv2.c
+++ b/src/util/vircgroupv2.c
@@ -523,6 +523,7 @@ static int
 virCgroupV2Remove(virCgroupPtr group)
 {
     g_autofree char *grppath = NULL;
+    virCgroupPtr parent = virCgroupGetNested(group);
     int controller;

     /* Don't delete the root group, if we accidentally
@@ -534,7 +535,7 @@ virCgroupV2Remove(virCgroupPtr group)
     if (virCgroupV2PathOfController(group, controller, "", &grppath) < 0)
         return 0;

-    if (virCgroupV2DevicesRemoveProg(group) < 0)
+    if (virCgroupV2DevicesRemoveProg(parent) < 0)
         return -1;

     return virCgroupRemoveRecursively(grppath);


> > Unfortunately we cannot simply ignore ENOENT in the
> > virCgroupV2DevicesDetectProg() because it is used in different places
> > where such error should be reported.
> > 
> > What we can do is to introduce new parameter `bool quiet` into
> > virCgroupV2DevicesDetectProg() that would silence the ENOENT error if
> > set to true and we could use it from virCgroupV2DevicesRemoveProg()
> > or something similar.
> 
> This smells fishy to me. I tried it, and it does indeed get us to
> virCgroupRemoveRecursively() later down that callchain. The downside is it
> is of course called with the same cgroup
> (grppath="/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/")
> and returns without doing anything when virDirOpenQuiet() also returns
> ENOENT.

Well that's the only thing we can do if the directories no longer exist.

Pavel

> Eric
> 
> > 
> > Pavel
> > 
> > > > 
> > > > This happens only on cgroup controllers managed by systemd. For example
> > > > if you switch to cgroups v1 where each controller is in separate
> > > > directory not all controllers supported by libvirt are also supported by
> > > > systemd. In this case libvirt creates all the cgroups by itself and is
> > > > responsible to cleanup as well. With this patch we would not remove the
> > > > VM root cgroups in these controllers. This would affect following
> > > > controllers:
> > > > 
> > > >       cpuset
> > > >       freezer
> > > >       net_cls
> > > >       net_prio
> > > >       perf_event
> > > > 
> > > > You can verify what happens with cgroups v1 by adding
> > > > systemd.unified_cgroup_hierarchy=0 to your kernel cmdline.
> > > 
> > > Neat, thanks for that tip.
> > > 
> > > Thanks,
> > > Eric
> > > 
> > > > 
> > > > Pavel
> > > > 
> > > 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20210409/2a869f36/attachment-0001.sig>


More information about the libvir-list mailing list