[libvirt] [openstack-dev] [nova] The risk of hanging when shutdown instance.

Fri Apr 10 14:04:34 UTC 2015

On 03/30/2015 11:37 PM, zhang bo wrote:
> On 2015/3/31 4:36, Eric Blake wrote:
>
>> On 03/30/2015 06:08 AM, Michal Privoznik wrote:
>>> On 30.03.2015 11:28, zhang bo wrote:
>>>> On 2015/3/28 18:06, Rui Chen wrote:
>>>>
>>>>> <snip/>
>>>>   The API virDomainShutdown's description is out of date, it's not correct.
>>>>   In fact, virDomainShutdown would block or not, depending on its mode. If it's in mode *agent*, then it would be blocked until qemu founds that the guest actually got down.
>>>> Otherwise, if it's in mode *acpi*, then it would return immediately.
>>>>   Thus, maybe further more work need to be done in Openstack.
>>>>
>>>>   What's your opinions, Michal and Daniel (from libvirt.org), and Chris (from openstack.org) :)
>>>>
>>>
>>> Yep, the documentation could be better in that respect. I've proposed a
>>> patch on the libvirt upstream list:
>>>
>>> https://www.redhat.com/archives/libvir-list/2015-March/msg01533.html
>> I don't think a doc patch is right.  If you don't pass any flags, then
>> it is up to the hypervisor which method it will attempt (agent or ACPI).
>>  Yes, explicitly requesting an agent as the only method to attempt might
>> be justifiable as a reason to block, but the overall API contract is to
>> NOT block indefinitely.  I think that rather than a doc patch, we need
>> to fix the underlying bug, and guarantee that we return after a finite
>> time even when the agent is involved.
>>
> So, may we get to a final decision? :) Shall we timeout in virDomainShutdown() or leave it to openstack?
> The 2 solutions I can see are:
> 1) timeout in virDomainShutdown() and virDomainReboot(). in libvirt.
> 2) spawn a new thread to monitor the guest's status, if it's not shutoff after dom.shutdown() for a while,
>    call dom.destroy() to force shut it down.  in openstack.

And to complicate things a bit further, when you call dom.destroy() you
should probably first call it with the VIR_DOMAIN_DESTROY_GRACEFUL flag
(which will send SIGTERM to qemu, but never SIGKILL), so that qemu can
get a chance to flush the disk image files to disk, then if that fails,
call dom.destroy() again without that flag (which sends SIGTERM to qemu,
and if that fails after awhile it will send SIGKILL)

Here's the email for the patch that added DESTROY_GRACEFUL for a better
explanation:

  http://www.redhat.com/archives/libvir-list/2012-February/msg00124.html