[PATCH 1/1] vircgroup: Cleanup nested cgroups

Eric Farman farman at linux.ibm.com
Fri Apr 9 17:32:03 UTC 2021



On 4/9/21 10:14 AM, Pavel Hrdina wrote:
> On Fri, Apr 09, 2021 at 09:13:32AM -0400, Eric Farman wrote:
>>> I still fail to see how calling virCgroupRemove(group->nested) in
>>> virCgroupRemove() would help at all. The original issue you mentioned in
>>> the commit message is that we log this error:
>>>
>>>       unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/': No such file or directory
>>>
>>> So calling virCgroupRemove(group->nested) would also fail with:
>>>
>>>       unable to open '/sys/fs/cgroup/machine.slice/machine-qemu\x2d1\x2dguest.scope/libvirt/': No such file or directory
>>>
>>> because if the VM root cgroup ('machine-qemu\x2d1\x2dguest.scope')
>>> doesn't exist the nested cgroup ('libvirt') will not exist as well.
>>
>> While you are correct that the nested cgroup doesn't exist in sysfs, the
>> pointers being handled by libvirt still do since virCgroupFree() hasn't yet
>> been called. This message shows up because the cgroup we are processing (the
>> VM root cgroup) contains zeros for progfd and mapfd, and
>> virCgroupV2DevicesDetectProg() attempts to open the corresponding path that
>> systemd already cleaned up.
>>
>> The nested cgroup ('libvirt') has nonzero file descriptors, so the check at
>> the top of virCgroupV2DevicesDetectProg() would return before attempting to
>> open the nonexistent '/sys/fs/cgroup/.../libvirt/' path. In that case, the
>> message would NOT be generated, and the code will just go back to
>> virCgroupV2DevicesRemoveProg() to close the file descriptors.
> 
> That's one bug that while calling virCgroupV2DevicesRemoveProg() we are
> passing the root cgroup but we need to pass the nested cgroup instead.
> This will fix the FD leak when VMs are started and destroyed while
> libvirt is running the whole time.
> 
>>> Now to the progfd and mapfd being zero, that depends whether libvirtd
>>> was restarted while the VM was running.
>>
>> Huh? No, I see the progfd and mapfd fields zero in the VM root cgroup and
>> stored with the BPF file descriptors in the nested ('libvirt') cgroup with
>> your patch. Previously, they were in the VM root cgroup itself.
>>
>> When a VM is started on host
>>> with cgroups v2 libvirt will create BPF program and BPF map to limit
>>> access to system devices and stores both in the mentioned variables.
>>> But if you restart libvirtd it will load the BPF prog ID and BPF map ID
>>> only on demand, for example when destroying the VM.
>>>
>>> Now on hosts with systemd the owner of the VM root cgroup is systemd
>>> and because we use call talk to systemd-machined and register the VM
>>> there the VM root cgroup is created for us by systemd-machined and it is
>>> associated with qemu PID. When destroying the VM we will kill the qemu
>>> process which will trigger systemd-machined to automatically cleanup the
>>> cgroup. Once that happens kernel should eventually cleanup both BPF prog
>>> and BPF map that were associated with the nested cgroup because it no
>>> longer exist and there are no other references to the prog or map.
>>> It may take some time before kernel actually removes the prog and map.
>>
>> This all may be true, but as I said in my commit message libvirt's leaving
>> open two file descriptors for the BPF prog and map that aren't being closed
>> when the guest is shut down. I was attempting to trigger a different bug by
>> doing a virsh create/shutdown in a loop, and eventually my creates would
>> fail because I'd exhausted the number of open files.
>>
>>> The only thing on systemd based host we can do is to skip the whole BPF
>>> detection code if the directory was already removed which I've already
>>> proposed by handling ENOENT in virCgroupV2DevicesDetectProg().
>>
>> Yes. That removes the message, but doesn't clean up the file descriptors
>> being left open.
> 
> Correct, to me it was not obvious from the original commit message that
> we are leaking file descriptors because I was more focused on the error
> itself and I was understanding that you are trying to fix the code to
> stop printing that error.

I'm glad I mentioned both, though it wasn't obvious to me there was a 
relation between the two until I started looking into the file 
descriptor problem.

> 
>>> All of the above applies to systemd based hosts. If we consider system
>>> without systemd then there is an actual bug as you already pointed out
>>> that the BPF prog and BPF map are now associated with the nested cgroup,
>>> to fix that we should change only virCgroupV2Remove:
>>>
>>> diff --git a/src/util/vircgroupv2.c b/src/util/vircgroupv2.c
>>> index 248d4047e5..4664492c34 100644
>>> --- a/src/util/vircgroupv2.c
>>> +++ b/src/util/vircgroupv2.c
>>> @@ -523,6 +523,7 @@ static int
>>>    virCgroupV2Remove(virCgroupPtr group)
>>>    {
>>>        g_autofree char *grppath = NULL;
>>> +    virCgroupPtr parent = virCgroupGetNested(group);
>>>        int controller;
>>>
>>>        /* Don't delete the root group, if we accidentally
>>> @@ -534,7 +535,7 @@ virCgroupV2Remove(virCgroupPtr group)
>>>        if (virCgroupV2PathOfController(group, controller, "", &grppath) < 0)
>>>            return 0;
>>>
>>> -    if (virCgroupV2DevicesRemoveProg(group) < 0)
>>> +    if (virCgroupV2DevicesRemoveProg(parent) < 0)
>>>            return -1;
>>>
>>>        return virCgroupRemoveRecursively(grppath);
>>>
>>>
>>
>> This addresses both symptoms I'm experiencing, but it feels weird to be the
>> only place outside of vircgroup.c that needs to navigate/understand the
>> difference between group and group->nested. This is why I'd proposed that
>> logic in the caller, so it's covered by both v1 and v2, but if it's only the
>> v2 code that needs this then okay. I don't have a non-systemd system nearby
>> to try it with.
> 
> We use BPF only with cgroups v2, cgroups v1 have normal devices
> controller with sysfs files that we can write/read to/from.
> 
> I removed most of the previous quotes as they are not relevant now.
> 
> To summarize the current code has two issues:
> 
> 1) We are leaking BPF prog and BPF map file descriptors because they now
>     live in the nested "libvirt" cgroup instead of the VM root cgroup.
>     This can be addressed by the code above. This happens only if
>     libvirtd was not restarted between VM start and VM destroy.

Agreed.

> 
> 2) When running on systemd and libvirtd was restarted between VM start
>     and VM destroy we will always attempt to detect BPF prog and BPF map
>     in order to close it right away.

I'll trust you regarding this, because I haven't looked into what 
happens when libvirtd restarts.

> 
>     Now that I think about it we can most likely drop the detection
>     completely. That will fix the unnecessary error report when the
>     cgroup was already deleted by systemd.
> 
> 
> I'll test it on non-systemd OS as well and post a patches if you are OK
> with that.

That sounds great.

> 
> Thanks for noticing the FD leak!

You're welcome. Thanks for taking the time to walk through this with me! 
Cheers,
Eric

> 
> Pavel
> 




More information about the libvir-list mailing list