[PATCH 3/3] qemu: sync backing chain update in virDomainGetBlockJobInfo

Thu Oct 22 12:25:03 UTC 2020

On Tue, Oct 20, 2020 at 16:44:09 +0300, Nikolay Shirokovskiy wrote:
> Some mgmt still use polling for block job completion. After job completion the
> job failure/success is infered by inspecting domain xml. With legacy block job
> processing this does not always work.

So, fix your app? :)

> The issue deals with how libvirt processes events. If no other thread is
> waiting for blockjob event then event processing if offloaded to worker thread.
> If now virDomainGetBlockJobInfo API is called then as block job is already
> dismissed in legacy scheme the API returns 0 but backing chain is not yet
> updated as processing yet be done in worker thread. Now mgmt checks backing
> chain right after return from the API call and detects error.
> 
> This happens quite often under load. I guess because of we have only one worker
> thread for all the domains.

So basically the problem is that qemuDomainGetBlockJobInfo returns
nothing, but the job is still stuck.

> The event delivery is synchronous in qemu and block job completed event is sent
> in job finalize step so if block job is absent the event is already delivered
> and we just need to process it.
> 
> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy at virtuozzo.com>
> ---
>  src/qemu/qemu_driver.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> index c06db3a..7189a29 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -14672,8 +14672,15 @@ qemuDomainGetBlockJobInfo(virDomainPtr dom,
>      ret = qemuMonitorGetBlockJobInfo(qemuDomainGetMonitor(vm), job->name, &rawInfo);
>      if (qemuDomainObjExitMonitor(driver, vm) < 0)
>          ret = -1;
> -    if (ret <= 0)
> +    if (ret < 0)
> +        goto endjob;
> +    if (ret == 0) {
> +        qemuDomainObjPrivatePtr priv = vm->privateData;
> +
> +        if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_BLOCKDEV))
> +            qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE);
>          goto endjob;
> +    }
>  
>      if (qemuBlockJobInfoTranslate(&rawInfo, info, disk,
>                                    flags & VIR_DOMAIN_BLOCK_JOB_INFO_BANDWIDTH_BYTES) < 0) {

My alternate proposal is to not return 0 if qemuMonitorGetBlockJobInfo
returns 0 jobs, but rather continue to qemuBlockJobInfoTranslate which
will report fake data for the job.

That will prevent the ugly changes to the handling code and will pretend
that the job still isn't complete for broken apps.