[libvirt] [PATCH] qemu: Qemu process unexpectedly killed in repeated reboot

Wangjing (King) king.wang at huawei.com
Tue Dec 4 06:54:01 UTC 2018


On 11/30/18 9:53 AM, Wang King wrote:
>> The issue occurs when I make repeated calls to virDomainReboot with 
>> VIR_DOMAIN_REBOOT_DEFAULT flag. In the first call to reboot domain, 
>> the qemu driver chose ACPI path, and set priv->fakeReboot to true.
>> Then in a second call, qemu driver chose agent to reboot which set 
>> fakeReboot to false. But because the guest already responded to ACPI 
>> shut down, libvirtd daemon will process a SHUTDOWN event in 
>> qemuProcessShutdownOrReboot and checks priv->fakeReboot. Since the 
>> fakeReboot flag is now false, qemu process is unexpectedly killed.
>> 
>
>This sounds fishy. Looking at the code libvirt decides whether to use agent or ACPI based on:
>
>a) flags (but since you're passing 0 this is out of the picture),
>b) guest agent being available,
>
>This means that agent must have connected between two virDomainReboot() calls. Otherwise libvirt would make the same choice.
>
>> I have no idea how to fix it.
>
>Well, the qemuDomainSetFakeReboot(false) call was added in b0c144c5792 which points to:
>
>  https://www.redhat.com/archives/libvir-list/2015-April/msg00732.html
>
>I think the patch proposed there is actually right and not the one that was merged.
>
>Michal

The problem:

Consider this scenario: firstly, call virDomainReboot() with mode ACPI to a domain, then immediately call virDomainReboot() again with mode “agent”. Supposedly, this sequence will reboot the domain (hereinafter referred to as DOM1). Instead, however, the current version of libvirt code will *KILL* the QEMU process associated with the domain.

How to reproduce this problem:

Simply make a ACPI-mode reboot call to the domain, and immediately make an agent-mode reboot call. You will see that the domain is killed straight away. 

Root cause:

The unexpected SIGKILL comes from Libvirt seeing fakeReboot flag set to *FALSE* when it received an SHUTDOWN event as a result of calling ACPI virDomainReboot(). This happens when DOM1 is already rebooting, yet its guest agent is still available, thus the second agent-mode virDomainReboot() was able to pass the check and set fakeReboot flag to *FALSE* (its “guest-shutdown” command is also sent, which is out of the scope of this email). Finally, DOM1 shut down as part of the reboot process, its QEMU process sent SHUTDOWN event to Libvirt. Libvirt checked  fakeReboot flag and saw it being *FALSE*, forcefully kills the QEMU process as a result.


Analysis:

Say that we make two calls to shutdown/reboot DOM1. DOM1 will eventually respond to only one of the two calls. For another call, only two scenarios happen: DOM1 accepts the second call but do not execute it (meaning only fakeReboot flag is modified on Libvirt side), or DOM1 accepts the second call and overrides the first one. In the second scenario, it is the same situation as first scenario but reversed (DOM1 responds to first call and sets flag with second call vs DOM1 responds to second call and sets flag with first call). Therefore, we hereby simplify the problem to that DOM1 always responds to first call, and the second call is also accepted, but it only serves to change the fakeReboot flag. As you will see, there is what we expect to happen versus what actually happens. The questionable result is marked with tilde(~) and the faulted result is marked with emphasis(*).

If you see the chart below, you will see that how the domain should react to shutdown/reboot calls in different order:

-----------------------------------------------------------------------------------------------------------
|First Command       |Second Command(set flags only)   |Expected Result   |Actual Result   |After Fix      |
-----------------------------------------------------------------------------------------------------------
|ACPI SHUTDOWN 	|ACPI SHUTDOWN               |SHUTDOWN      |SHUTDOWN   |SHUTDOWN    |
|ACPI REBOOT		|ACPI SHUTDOWN               |REBOOT         |~SHUTDOWN~  |~SHUTDOWN~  |
|AGENT SHUTDOWN	|ACPI SHUTDOWN               |SHUTDOWN      |SHUTDOWN    |SHUTDOWN   |
|AGENT REBOOT      |ACPI SHUTDOWN               |REBOOT         |REBOOT       |REBOOT       |
|ACPI SHUTDOWN     |ACPI REBOOT                  |SHUTDOWN      |~REBOOT~     |~REBOOT~     |
|ACPI REBOOT        |ACPI REBOOT                  |REBOOT         |REBOOT       |REBOOT       |
|AGENT SHUTDOWN   |ACPI REBOOT                  |SHUTDOWN      |~REBOOT~     |~REBOOT~     |
|AGENT REBOOT      |ACPI REBOOT                  |REBOOT         |REBOOT       |REBOOT       |
|ACPI SHUTDOWN     |AGENT SHUTDOWN             |SHUTDOWN      |SHUTDOWN    |SHUTDOWN    |
|ACPI REBOOT        |AGENT SHUTDOWN             |REBOOT         |~SHUTDOWN~  |~SHUTDOWN~  |
|AGENT SHUTDOWN   |AGENT SHUTDOWN             |SHUTDOWN      |SHUTDOWN    |SHUTDOWN    |
|AGENT REBOOT      |AGENT SHUTDOWN             |REBOOT         |REBOOT        |REBOOT       |
|ACPI SHUTDOWN     |AGENT REBOOT                |SHUTDOWN      |SHUTDOWN     |SHUTDOWN   |
|ACPI REBOOT        |AGENT REBOOT                |REBOOT         |**SHUTDOWN**  |**REBOOT**  |
|AGENT SHUTDOWN   |AGENT REBOOT                |SHUTDOWN      |SHUTDOWN     |SHUTDOWN   |
|AGENT REBOOT      |AGENT REBOOT                |REBOOT         |REBOOT        |REBOOT       |
----------------------------------------------------------------------------------------------------------
As you can see, when DOM1 reboots with ACPI and immediately receives agent-mode reboot. It will *SHUTDOWN* instead. (Certainly, there are questionable results in other combos, but we think they cause less confusions in real-life situations so we are not discussing them here)

Fix:
The fix is to revert the changes added by b0c144c5792 commit, which introduced setting fakeReboot to False in agent-mode reboot. This should, at least, avoid killing QEMU process during successive reboot calls. Nonetheless, shutdown/reboot process still needs some optimizations to be done.  I will leave it to the analysis above. What do you think about this? Should we revert the commit first or find an all-fix solution to this problem?




More information about the libvir-list mailing list