[libvirt] [PATCH v2] Add API to change qemu agent response timeout

Tue Nov 12 16:00:47 UTC 2019

On 11/9/19 1:12 AM, Jonathon Jongsma wrote:
> Thanks for the thorough review Michal, clearly it needs more work. In
> particular, I was not yet aware of the private xml save/restore stuff.
> Thanks for the pointers in that regard. But I have one particular
> question below:
> 
> On Fri, 2019-11-08 at 11:46 +0100, Michal Privoznik wrote:
>> 1: It's not that simple. Firstly, you need to grab an agent job
>> (QEMU_AGENT_JOB_MODIFY ideally) to make sure no other thread is in
>> the
>> middle of talking to the agent. Secondly, some domains may not have
>> an
>> agent configured at all (in which case, priv->agent is going to be
>> NULL). And lastly, we need to track the timeout set in domain
>> private
>> data so that if libvirtd restarts the same timeout is preserved. But
>> changing private data means you have to acquire domain job too.
> 
> 
> Is this really true? I'm obviously still not intimately familiar with
> the libvirt 'job' stuff, but my understanding of jobs is that it is a
> method to avoid multiple threads talking to the qemu monitor (or agent,
> for agent jobs) at the same time. The data integrity of the structs
> themselves are protected by mutex. But a mutex is insufficient to
> mediate access to the monitor/agent because waiting for a response may
> involve a sleep, and the mutex will be released while the thread
> sleeps. So this 'job' system was added on top to indicate that the
> monitor/agent was in use and other threads need to wait until the job
> is complete before they can interact with the monitor or agent. Is my
> understanding correct so far?

Yes, more or less. You're right in assuming that mutex protects 
strucutres, and job is like a flag:

lock(struct);
struct->doNotModify = true;
unlock(struct);

... /* now any other thread sees doNotModify set => should avoid 
modifications */

lock(struct);
struct->doNotModify = false;
unlock(struct);

But one can also argue that every API should grab a job (we have both 
query and modify jobs available), because resulting code is future 
proof. I mean, if anybody ever adds a monitor/agent call somewhere, we 
won't enter monitor without job set (i.e. unlock domain object without 
doNotModify flag set in my example).

To put this into a perspective, imagine there is a thread that does two 
agent calls (e.g. creating a snapshot with --quiesce calls 'fsfreeze' 
before creating snapshot followed by 'fsthaw' once snapshot is created).
Since talking to an agent means releasing its lock, this thread might 
execute freeze and thaw with different timeouts (if your API is issued 
meanwhile). Perhaps in this simple case it's no problem and I just want 
to apply defensive programming with no real trade.
> 
> In this situation, we're not talking to the qemu monitor or the agent,
> we're merely changing some struct data. So from a data-integrity point
> of view, we simply need to hold the mutex lock, right? (We do hold the
> mutex lock for the domain from calling qemuDomainObjFromDomain()). So
> it doesn't seem strictly necessary to acquire a job here (either for
> the agent, or for the domain/monitor).
> 
> One argument I can see for acquiring a job here is that it ensures that
> any private data is not changed while another thread is in the middle
> of a monitor query. And that's potentially a compelling justification
> for acquiring a job. But in this circumstance, I don't see that it
> gains us much.
> 
> For example, let's say Thread A has already acquired an agent job for
> getHostname(). This thread will call qemuAgentCommand() and pass the
> value of mon->timeout for the 'seconds' argument. Inside this function,
> it will further call qemuAgentGuestSync(), which again uses the mon-
>> timeout value to determine how long it should wait for a response. It
> sends the guest-sync command to the guest agent, and then waits for a
> response (releasing the mutex). While it is waiting, let's say that
> Thread B calls SetResponseTimeout(). Let's further imagine that
> SetResponseTimeout() doesn't acquire a job (as in my patch). It can
> acquire the agent mutex (since the other thread released it while
> sleeping) and safely change the mon->timeout value immediately. At some
> point, Thread A wakes up, handles the guest-sync response and issues
> the 'guest-get-host-name' command to the agent. It uses the value
> passed as an argument to qemuAgentCommand() to determine how long to
> wait for a response from the agent. So it will not use the new value
> set by thread B, but will use the old value that was set when it was
> first called. (But I think an argument could be made that it would be
> fine to use this new timeout value as well.)
> 
> This behavior all seems fine to me. Am I misunderstanding something
> fundamentally here? Or is there just a general policy that all data
> modification should be protected by a job, regardless of whether we're
> using the qemu monitor or agent? If so, perhaps that could be better
> documented.
> 

Alright, let's use only locks. However, one hidden advantage of grabbing 
a job here is that EndJob() APIs automatically save domain status XML 
onto disk => if you update priv->agentTimeout in the API, the status XML 
is automatically written thus new value is safely stored on disk in case 
libvirtd restarts.

> One final concern: in your fixup commit, you acquire both a normal
> domain job and an agent job (in the case where the agent is active).
> But this is something I've been working to remove from libvirt after a
> discussion with Daniel Berrange in
> https://bugzilla.redhat.com/show_bug.cgi?id=1759566
> 

Yeah, well in this case we won't be really executing any guest agent 
API, merely mutually excluding with other threads that are talking to 
the agent. But let's use mutexes only.

Michal