[libvirt] [PATCH 1/1] qemu: sync blockjob finishing in qemuDomainGetBlockJobInfo

Thu Nov 28 07:29:08 UTC 2019

On 27.11.2019 17:56, Peter Krempa wrote:
> On Wed, Nov 27, 2019 at 17:19:18 +0300, Nikolay Shirokovskiy wrote:
>> Due to race qemuDomainGetBlockJobInfo can return there is
>> no block job for disk but later call to spawn new blockjob
>> can fail because libvirt internally still not process blockjob
>> finishing. Thus let's wait for blockjob finishing if we
>> report there is no more blockjob.
> 
> Could you please elaborate how this happened? e.g. provide some logs?
> 
> Polling with qemuDomainGetBlockJobInfo will always be racy and if the
> problem is that diskPriv->blockjob is allocated but qemu didn't report
> anything I suspect the problem is somewhere else.
> 
> 
>>
>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy at virtuozzo.com>
>> ---
>>  src/qemu/qemu_driver.c | 18 +++++++++++++++++-
>>  1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>> index 669c12d6ca..b148df3a57 100644
>> --- a/src/qemu/qemu_driver.c
>> +++ b/src/qemu/qemu_driver.c
>> @@ -17785,12 +17785,26 @@ qemuDomainGetBlockJobInfo(virDomainPtr dom,
>>          goto endjob;
>>      }
>>  
>> +    qemuBlockJobSyncBegin(job);
>> +
>>      qemuDomainObjEnterMonitor(driver, vm);
>>      ret = qemuMonitorGetBlockJobInfo(qemuDomainGetMonitor(vm), job->name, &rawInfo);
>>      if (qemuDomainObjExitMonitor(driver, vm) < 0)
>>          ret = -1;
>> -    if (ret <= 0)
>> +    if (ret < 0)
>> +        goto endjob;
>> +
>> +    if (ret == 0) {
> 
> So if qemu returns that there is no blockjob ...
> 
>> +        qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE);
>> +        while (qemuBlockJobIsRunning(job)) {
> 
> but we think it's still running it's either that the event arrived but
> wasn't processed yet. In that case this would help, but the outcome
> would not differ much from the scenario if the code isn't here as the
> event would be processed later.
> 
> 
>> +            if (virDomainObjWait(vm) < 0) {
>> +                ret = -1;
>> +                goto endjob;
>> +            }
>> +            qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE);
> 
> If there's a bug and qemu doesn't know that there's a job but libvirt
> thinks there's one but qemu is right, this will lock up here as the
> event will never arrive.
> 
>> +        }
>>          goto endjob;
> 
> My suggested solution would be to fake the job from the data in 'job'
> rather than attempt to wait in this case e.g.
> 
> if (ret == 0) {
>     rawStats.type = job->type;
>     rawStats.ready = 0;
> }
> 
> and then make it to qemuBlockJobInfoTranslate which sets some fake
> progress data so that the job looks active.
> 
> That way you get a response that there's a job without the potential to
> deadlock.
> 

Fake progress data does not look nice. May be we should cache progress on qemuDomainGetBlockJobInfo
calls so it will not go backwards after the patch?

I can understand protecting against possible bugs when recovering is not possible or difficult
like case of data corruption. But in this case it is matter of libvirtd restart I guess.

Nikolay