[libvirt-users] Could not destroy domain, current job is remoteDispatchConnectGetAllDomainStats

Vasiliy Tolstov v.tolstov at selfip.ru
Fri Jan 19 13:10:40 UTC 2018


2018-01-18 18:49 GMT+03:00 Michal Privoznik <mprivozn at redhat.com>:
> On 01/18/2018 08:25 AM, Ján Tomko wrote:
>> On Wed, Jan 17, 2018 at 04:45:38PM +0200, Serhii Kharchenko wrote:
>>> Hello libvirt-users list,
>>>
>>> We're catching the same bug since 3.4.0 version (3.3.0 works OK).
>>> So, we have process that is permanently connected to libvirtd via socket
>>> and it is collecting stats, listening to events and control the VPSes.
>>>
>>> When we try to 'shutdown' a number of VPSes we often catch the bug.
>>> One of
>>> VPSes sticks in 'in shutdown' state, no related 'qemu' process is
>>> present,
>>> and there is the next error in the log:
>>>
>>> Jan 17 13:54:20 server1 libvirtd[20437]: 2018-01-17 13:54:20.005+0000:
>>> 20438: warning : qemuGetProcessInfo:1460 : cannot parse process status
>>> data
>>> Jan 17 13:54:20 server1 libvirtd[20437]: 2018-01-17 13:54:20.006+0000:
>>> 20441: error : virFileReadAll:1420 : Failed to open file
>>> '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d36\x2dDOMAIN1.scope/cpuacct.usage':
>>>
>>> No such file or directory
>>> Jan 17 13:54:20 server1 libvirtd[20437]: 2018-01-17 13:54:20.006+0000:
>>> 20441: error : virCgroupGetValueStr:844 : Unable to read from
>>> '/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d36\x2dDOMAIN1.scope/cpuacct.usage':
>>>
>>> No such file or directory
>>> Jan 17 13:54:20 server1 libvirtd[20437]: 2018-01-17 13:54:20.006+0000:
>>> 20441: error : virCgroupGetDomainTotalCpuStats:3319 : unable to get cpu
>>> account: Operation not permitted
>>> Jan 17 13:54:23 server1 libvirtd[20437]: 2018-01-17 13:54:23.805+0000:
>>> 20522: warning : qemuDomainObjBeginJobInternal:4862 : Cannot start job
>>> (destroy, none) for domain DOMAIN1; current job is (query, none) owned by
>>> (20440 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (30s, 0s)
>>> Jan 17 13:54:23 server1 libvirtd[20437]: 2018-01-17 13:54:23.805+0000:
>>> 20522: error : qemuDomainObjBeginJobInternal:4874 : Timed out during
>>> operation: cannot acquire state change lock (held by
>>> remoteDispatchConnectGetAllDomainStats)
>>>
>>> I think only the last line matters.
>>> The bug is highly reproducible. We can easily catch it even when we call
>>> multiple 'virsh shutdown' in shell one by one.
>>>
>>> When we shutdown the process connected to the socket - everything
>>> become OK
>>> and the bug is gone.
>>>
>>> The system is used is Gentoo Linux, tried all modern versions of libvirt
>>> (3.4.0, 3.7.0, 3.8.0, 3.9.0, 3.10.0, 4.0.0-rc2 (today's version from
>>> git))
>>> and they have this bug. 3.3.0 works OK.
>>>
>>
>> I don't see anything obvious stats related in the diff between 3.3.0 and
>> 3.4.0. We have added reporting of the shutdown reason, but that's just
>> parsing one more JSON reply we previously ignored.
>>
>> Can you try running 'git bisect' to pinpoint the exact commit that
>> caused this issue?
>
> I am able to reproduce this issue, ran bisect and fount that the commit
> which broke it is aeda1b8c56dc58b0a413acc61bbea938b40499e1.
>
> https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=aeda1b8c56dc58b0a413acc61bbea938b40499e1;hp=ec337aee9b20091d6f9f60b78f210d55f812500b
>
> But it's very unlikely that the commit is causing the error. If anything
> it is just exposing whatever error we have there. I mean, if I revert
> the commit on the top of current HEAD I can no longer reproduce the issue.
>

Yes, sometimes ago i'm already reported this in list and someone from
virtuozzo/openvz team helps me to revert exactly this commit.



-- 
Vasiliy Tolstov,
e-mail: v.tolstov at selfip.ru




More information about the libvirt-users mailing list