[libvirt] [PATCH 5/6] qemu: Leave cpuset.mems in parent cgroup alone

Mon Dec 22 09:45:33 UTC 2014

On Wed, Dec 17, 2014 at 04:59:30PM -0700, Eric Blake wrote:
>On 12/17/2014 08:06 AM, Martin Kletzander wrote:
>> On Wed, Dec 17, 2014 at 12:00:36AM -0700, Eric Blake wrote:
>>> On 12/16/2014 11:51 PM, Eric Blake wrote:
>>>> On 12/15/2014 12:58 AM, Martin Kletzander wrote:
>>>>> Instead of setting the value of cpuset.mems once when the domain starts
>>>>> and then re-calculating the value every time we need to change the
>>>>> child
>>>>> cgroup values, leave the cgroup alone and rather set the child data
>>>>> every time there is new cgroup created.  We don't leave any task in the
>>>>> parent group anyway.  This will ease both current and future code.
>>>>>
>>>>> Signed-off-by: Martin Kletzander <mkletzan at redhat.com>
>>>>> ---
>>>>>  src/qemu/qemu_cgroup.c | 67
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++--
>>>>>  src/qemu/qemu_driver.c | 59
>>>>> +++++++++++++++-----------------------------
>>>>>  2 files changed, 85 insertions(+), 41 deletions(-)
>>>>
>>>> This patch causes libvirtd to segfault on startup:
>>>
>>> More particularly, I'm doing an upgrade from older libvirtd to this
>>> version, while leaving a transient domain running that was started by
>>> the older libvirtd.  Hope that helps you narrow in on the problem.
>
>Weird - At the time I made the report, I ran 'git bisect' and reliably
>reproduced the crash across multiple libvirtd restarts (a restart for
>each new build while trying to nail down the culprit commit), as long as
>I left that transient domain running.
>
>>>
>>
>> I tried that and it works for me.  And I tried various domains, both
>> heavily cgroup dependent and simple ones.
>
>But now that I've rebooted, and then proceeded to do incremental builds
>from both before and after the patch, I can't reproduce the crash.
>Although my formula for creating my transient domain was the same both
>yesterday and today, there may be some difference in the version of
>libvirtd that was running at the time I created the domain that then
>affected the XML affecting the libvirtd restarts.
>
>>
>>> Reverting 86759e and af2a1f0 was sufficient to get me going again
>>> locally, but I'm not going to push my reversions until you've first had
>>> a chance to address the regression.
>>>
>
>>>> #2  0x00007ffff7405673 in virCgroupHasEmptyTasks (cgroup=0x0,
>>>> controller=2)
>>>>     at util/vircgroup.c:3935
>>
>> From this line it looks like priv->cgroup was not initialized.  I did
>> not add a check for that, so that may be the cause.  I'll send a patch
>> soon.
>>
>> But I wonder how did you manage to do that, is that a session libvirtd
>> you restarted?  Otherwise how come virCgroupNewDetectMachine() didn't
>> fill the cgroup?
>
>I'm not spotting anything obvious why it wasn't initialized at the time
>I reproduced the crash, nor why I can't reproduce it now.  Hopefully
>it's not a lurking time bomb.
>
>If it happens again, I'll definitely report it; but for now, without a
>reliable reproduction, it could easily have been something caused on my
>end (since it is my dev machine, it may have been caused by a half-baked
>patch on my end that I was testing at the time I created the transient
>domain, where using only upstream patches wouldn't see the issue).  So
>for now, don't worry too hard if you can't find it either.
>

There is a valid code path that allows the cgroup to be NULL, although
not easy to reach.  However, since it is valid, we added a patch that
just skips the cgroup restoration for cgroup == NULL.

If you happen to stumble upon a similar problem, I'd be happy to try
dealing with it.

Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20141222/f6c037ad/attachment-0001.sig>