[libvirt] [PATCHv2 2/2] qemu: increase the timeout before sending SIGKILL to qemu process

Daniel Veillard veillard at redhat.com
Fri Feb 3 08:24:35 UTC 2012


On Thu, Feb 02, 2012 at 12:54:29PM -0500, Laine Stump wrote:
> The current default method of terminating the qemu process is to send
> a SIGTERM, wait for up to 1.6 seconds for it to cleanly shutdown, then
> send a SIGKILL and wait for up to 1.4 seconds more for the process to
> terminate. This is problematic because occasionally 1.6 seconds is not
> long enough for the qemu process to flush its disk buffers, so the
> guest's disk ends up in an inconsistent state.
> 
> Although a previous patch has provided a new flag to allow
> applications to alter this behavior, it will take time for
> applications to be updated to use this new flag, and since the fault
> for this inconsistent state lays solidly with libvirt, libvirt should
> be proactive about preventing the situation even before the
> applications can be updated.
> 
> Since this only occasionally happens when the timeout prior to SIGKILL
> is 1.6 seconds, this patch increases that timeout to 10 seconds. At
> the very least, this should reduce the occurrence from "occasionally"
> to "extremely rarely". (Once SIGKILL is sent, it waits another 5
> seconds for the process to die before returning).
> 
> Note that in the cases where it takes less than this for qemu to
> shutdown cleanly, libvirt will *not* wait for any longer than it would
> without this patch - qemuProcessKill polls the process and returns as
> soon as it is gone.
> ---
>  src/qemu/qemu_process.c |   18 ++++++++++--------
>  1 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
> index 1bbb55c..5044d76 100644
> --- a/src/qemu/qemu_process.c
> +++ b/src/qemu/qemu_process.c
> @@ -3547,14 +3547,16 @@ int qemuProcessKill(virDomainObjPtr vm, unsigned int flags)
>  
>      /* This loop sends SIGTERM (or SIGKILL if flags has
>       * VIR_QEMU_PROCESS_KILL_FORCE and VIR_QEMU_PROCESS_KILL_NOWAIT),
> -     * then waits a few iterations (3 seconds) to see if it
> -     * dies. Halfway through this wait, if the qemu process still
> -     * hasn't exited, and VIR_QEMU_PROCESS_KILL_FORCE is requested, a
> -     * SIGKILL will be sent.  Note that the FORCE mode could result
> -     * in lost data in the guest, so it should only be used if the
> -     * guest is hung and can't be destroyed in any other manner.
> +     * then waits a few iterations (10 seconds) to see if it dies. If
> +     * the qemu process still hasn't exited, and
> +     * VIR_QEMU_PROCESS_KILL_FORCE is requested, a SIGKILL will then
> +     * be sent, and qemuProcessKill will wait up to 5 seconds more for
> +     * the process to exit before returning.  Note that the FORCE mode
> +     * could result in lost data in the guest, so it should only be
> +     * used if the guest is hung and can't be destroyed in any other
> +     * manner.
>       */
> -    for (i = 0 ; i < 15; i++) {
> +    for (i = 0 ; i < 75; i++) {
>          int signum;
>          if (i == 0) {
>              if ((flags & VIR_QEMU_PROCESS_KILL_FORCE) &&
> @@ -3564,7 +3566,7 @@ int qemuProcessKill(virDomainObjPtr vm, unsigned int flags)
>              } else {
>                  signum = SIGTERM; /* kindly suggest it should exit */
>              }
> -        } else if ((i == 8) & (flags & VIR_QEMU_PROCESS_KILL_FORCE)) {
> +        } else if ((i == 50) & (flags & VIR_QEMU_PROCESS_KILL_FORCE)) {
>              VIR_WARN("Timed out waiting after SIG%s to process %d, "
>                       "sending SIGKILL", signame, vm->pid);
>              signum = SIGKILL; /* nuke it after a grace period */

  On the semantic of the patch, it does what it suggest ACK to this
But that's unfortunately a pure heuristic, when the domain doesn't
fail to stop gracefully, there is no problem and this doesn't change
anything. If the domain is doing intensive I/Os flushing buffers for
example the extra grace period may help but there is absolutely no
garantee. On linux we could try to be a bit smart and detect completely
stuck guests by looking at /proc/$pid/io rchar and wchar if that doesn't
move at all in the iterations we can probably consider it dead, if
it does well we can be pretty sure that SIGKILL will loose data :-\

  ACK at this heuristic attempt but maybe a smarter algorithm is
in order, I'm sure others will comment :-)

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel at veillard.com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/




More information about the libvir-list mailing list