[libvirt] [Discussion] How do we think about time out mechanism?

Tue Aug 5 07:15:18 UTC 2014

On 2014/8/4 19:59, Martin Kletzander wrote:

> On Sat, Jul 26, 2014 at 03:47:09PM +0800, James wrote:
>> On 2014/7/25 18:07, Martin Kletzander wrote:
>>
>>> On Fri, Jul 25, 2014 at 04:45:55PM +0800, James wrote:
>>>> There's a kind of situation that when libvirtd's under a lot of pressure, just as we
>>>> start a lot of VMs at the same time, some libvirt APIs may take a lot of time to return.
>>>> And this will block the up level job to be finished. Mostly we can't wait forever, we
>>>> want a time out mechnism to help us out. When one API takes more than some time, it can
>>>> return time out as a result, and do some rolling back.
>>>>
>>>> So my question is: do we have a plan to give a 'time out' solution or a better solution
>>>> to fix this kind of problems in the future? And when?
>>>>
>>>
>>> Is it only because there are not enough workers available?  If yes,
>>> then changing the limits in libvirtd.conf (both global and
>>> per-connection) might be the easiest way to go.
>>>
>>> Martin
>>
>>
>> That's very nice to receive your reply quickly.
>>
>> The job pressure is just one point for time out mechnism. If something really bad happened
>> just like a blocked bug which stops libvirt API returning, and it's very rare to happen,
>> what can we do to assure the job not blocked by the blocked API?
>>
>> It's like Process A call libvirt API b, but b never returns, A is blocked there forever, so
>> what's the best for us to do?
>>
> 
> As that is pretty rare case that cannot be dealt with inside the API
> (since the API is the place where it gets locked), it has to be dealt
> with outside it.  I guess whatever you would do by hand is OK.  If,
> for example, you are used to restart libvirtd after the block is
> detected, then restart it and try again.  You can spawn another
> process that will do it if you want some fine-grained control, or you
> can use client (and server) -side keepalive to be automatically
> disconnected in case the block happens inside the event loop (but it
> won't catch it outside).  I'm not sure how to answer more properly
> since this is not libvirt-specific.  If there's something
> libvirt-specific I missed, let me know.
> 
> Martin

Thanks.

In fact, to deal with this kind of situation, we add some timeout codes in libvirtd, during remote_dispatch process.
The mechanism is like this:
1. when we call an API, we start a thread to do the timer, when time out, the timer set a timeout flag to the API,
   and return timeout result to the libvirt client.
2. when the API return to remote_dispatch level, it checkout the timeout flag to consider what to do next.
   If timeout, we do some rollback action. It's like detach device, if we attach device at first.

In this solution, there's something trouble, first, we have to figure out suitable rollback actions. Second, I'm
not sure it's the best way to solve this kind of block problem, not so elegant.

How do you think about it?

-- 
Best Regards

James