[PATCH for-6.1? v2 5/7] job: Add job_cancel_requested()

Wed Aug 4 14:15:03 UTC 2021

On 04.08.21 12:34, Kevin Wolf wrote:
> [ Peter, the question for you is at the end. ]
>
> Am 04.08.2021 um 10:07 hat Max Reitz geschrieben:
>> On 03.08.21 16:25, Kevin Wolf wrote:
>>> Am 26.07.2021 um 16:46 hat Max Reitz geschrieben:
>>>> Most callers of job_is_cancelled() actually want to know whether the job
>>>> is on its way to immediate termination.  For example, we refuse to pause
>>>> jobs that are cancelled; but this only makes sense for jobs that are
>>>> really actually cancelled.
>>>>
>>>> A mirror job that is cancelled during READY with force=false should
>>>> absolutely be allowed to pause.  This "cancellation" (which is actually
>>>> a kind of completion) may take an indefinite amount of time, and so
>>>> should behave like any job during normal operation.  For example, with
>>>> on-target-error=stop, the job should stop on write errors.  (In
>>>> contrast, force-cancelled jobs should not get write errors, as they
>>>> should just terminate and not do further I/O.)
>>>>
>>>> Therefore, redefine job_is_cancelled() to only return true for jobs that
>>>> are force-cancelled (which as of HEAD^ means any job that interprets the
>>>> cancellation request as a request for immediate termination), and add
>>>> job_cancel_request() as the general variant, which returns true for any
>>>> jobs which have been requested to be cancelled, whether it be
>>>> immediately or after an arbitrarily long completion phase.
>>>>
>>>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
>>>> Signed-off-by: Max Reitz <mreitz at redhat.com>
>>>> ---
>>>>    include/qemu/job.h |  8 +++++++-
>>>>    block/mirror.c     | 10 ++++------
>>>>    job.c              |  7 ++++++-
>>>>    3 files changed, 17 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/include/qemu/job.h b/include/qemu/job.h
>>>> index 8aa90f7395..032edf3c5f 100644
>>>> --- a/include/qemu/job.h
>>>> +++ b/include/qemu/job.h
>>>> @@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
>>>>    /** Returns true if the job should not be visible to the management layer. */
>>>>    bool job_is_internal(Job *job);
>>>> -/** Returns whether the job is scheduled for cancellation. */
>>>> +/** Returns whether the job is being cancelled. */
>>>>    bool job_is_cancelled(Job *job);
>>>> +/**
>>>> + * Returns whether the job is scheduled for cancellation (at an
>>>> + * indefinite point).
>>>> + */
>>>> +bool job_cancel_requested(Job *job);
>>> I don't think non-force blockdev-cancel for mirror should actually be
>>> considered cancellation, so what is the question that this function
>>> answers?
>>>
>>> "Is this a cancelled job, or a mirror block job that is supposed to
>>> complete soon, but only if it doesn't switch over the users to the
>>> target on completion"?
>> Well, technically yes, but it was more intended as “Has the user ever
>> invoked (block-)job-cancel on this job?”.
> I understand this, but is this much more useful to know than "Has the
> user ever called HMP 'change'?", if you know what I mean?

Hm.  Not really.  It’s still a crutch that shouldn’t be there ideally.

But I like this crutch for this series so I can get this batch done, and 
then worry about all the other bugs that keep popping up (and where 
job_cancel_requested() is a nice sign that something’s off).

>>> Is this ever a reasonable question to ask, except maybe inside the
>>> mirror implementation itself?
>> I asked myself the same for v3, but found two places in job.c where I
>> would like to keep it:
>>
>> First, there’s an assertion in job_completed_txn_abort().  All jobs
>> other than @job have been force-cancelled, and so job_is_cancelled()
>> would be true for them.  As for @job itself, the function is mostly
>> called when the job’s return value is not 0, but a soft-cancelled
>> mirror does have a return value of 0 and so would not end up in that
>> function.
>> But job_cancel() invokes job_completed_txn_abort() if the job has been
>> deferred to the main loop, which mostly correlates with the job having
>> been completed (in which case the assertion is skipped), but not 100 %
>> (there’s a small window between setting deferred_to_main_loop and the
>> job changing to a completed state).
>> So I’d prefer to keep the assertion as-is functionally, i.e. to only
>> check job->cancelled.
> Well, you don't. It's still job_is_cancelled() after this patch.

No: I didn’t. O:)

For v3, I had absolutely planned to use job_cancel_requested(), and I 
wanted to put the above explanation into the commit message.

> So the scenario you're concerned about is a job that has just finished
> successfully (job->ret = 0) and then gets a cancel request?

Yes.

> With force=false, I'm pretty sure the code is wrong anyway because
> calling job_completed_txn_abort() is not the right response.

Absolutely possible, I just didn’t want to deal with this, too… :/

> It should
> return an error because you're trying to complete twice, possibly with
> conflicting completion modes. Second best is just ignoring the cancel
> request because we obviously already fulfilled the request of completing
> the job (the completion mode might be different, though).
>
> With force=true, arguably still letting the job fail is correct.
> However, letting it fail involves more than just letting the transaction
> fail. We would have to call job_update_rc() as well so that instead of
> reporting success for the job, ECANCELED is returned and the job
> transitions to JOB_STATUS_ABORTING (after which job_is_completed()
> returns true).
>
> So, just bugs to be fixed.

Yep.  The question is, when/where; in this series or later?

> After this, I think we could even assert(job->ret != 0 ||
> job->status == JOB_STATUS_PENDING) in job_completed_txn_abort().
> ret == 0 can only happen when called from job_do_finalize(), when the
> job is only failing because other jobs in the same transaction have
> failed in their .prepare callback.

Sounds right.

>> Second, job_complete() refuses to let a job complete that has been
>> cancelled.  This function is basically only invoked by the user
>> (through qmp_block_job_complete()/qmp_job_complete(), or
>> job_complete_sync(), which comes from qemu-img), so I believe that it
>> should correspond to the external interface we have right now; i.e.,
>> if the user has invoked (block-)job-cancel at one point,
>> job_complete() should generally return an error.
> True. But it should also return an error if the user has invoked
> job-complete at some point. The distinction between complete and
> non-force cancel doesn't make sense there.

Yes, that’s true, it’s just that having double complete be a failure 
would be a change in behavior, and it would require another patch. Which 
is why I didn’t do it.

> And cancelling with force=false should fail, too, when either job-cancel
> or job-complete was called for the job before.

Yes.  At least for force=true, force=false is just a no-op, I believe.  
(.force_cancel is never reset to false.)

I’d like to defer this for the design series that Vladimir is planning 
to write, though.

>>> job_complete() is the only function outside of mirror that seems to use
>>> it. But even there, it feels wrong to make a difference. Either we
>>> accept redundant completion requests, or we don't. It doesn't really
>>> matter how the job reconfigures the graph on completion. (Also, I feel
>>> this should really have been part of the state machine, but I'm not sure
>>> if we want to touch it now...)
>> Well, yes, I don’t think it makes a difference because I don’t think
>> anyone will first tell the job via block-job-cancel to complete
>> without pivoting, and then change their mind and call
>> block-job-complete after all.  (Not least because that’s an error
>> pre-series.)
> Right, I'm just arguing that we shouldn't allow the opposite order
> either. Currently I think we do, and it's buggy, as explained above.

I agree.

>> Also, I’m not even sure whether completing after a soft cancel request
>> works.  I don’t think any of our code accounts for such a case, so I’d
>> rather avoid allowing it if there’s no need to allow it anyway.
> Yes, definitely avoid it. We should allow only one completion request
> (be it with job-complete or block-job-cancel) and return an error for
> all future completion requests for the same job.
>
> We could in theory keep allowing redundant completion requests when the
> completion mode doesn't conflict, but I don't see the point of that.

OK.  I personally think this need not be part of this series, though, so 
I’d like to defer it for now. O:)

(And since Vladimir is planning on turning soft cancel into a completion 
mode, I think we’re unlikely to forget about the problem.)

> Unless libvirt can actually issue multiple completion requests (again,
> this includes both (block-)job-complete and non-force block-job-cancel
> for mirror) for the same block job - Peter, I hope it doesn't?
>
> Kevin
>