[libvirt] [PATCH v2] network: make openvswitch call timeout compile time configurable

Wed Jan 25 09:05:35 UTC 2017

On 01/25/2017 09:05 AM, Boris Fiuczynski wrote:
> On 01/25/2017 04:16 AM, Laine Stump wrote:
>> On 01/24/2017 10:53 AM, Boris Fiuczynski wrote:
>>> Since a successful completion of the calls to openvswitch is expected
>>> a long timeout should be chosen to account for heavily loaded systems.
>>> Therefore this patch increases the timeout value from 5 to 120 seconds
>>> as default value and also allows to set the openvswitch timeout value
>>> by specifying with-ovs-timeout when running configure.
>>
>> Why make it configurable during build? I don't think we do this with any
>> other type of timeout value or limit. If you think it may need to change
>> based on circumstances, why not just put it in libvirtd.conf and be done
>> with it?
>>
>> In the meantime, I agree with Michal that any machine that takes 120
>> seconds to get a response from any ovs command is beyond the limits of
>> usable; we certainly shouldn't cater our defaults to that.
>>
> The first version of the patch was send in November last year hard
> coding the default value which resulted in this response.
> https://www.redhat.com/archives/libvir-list/2016-November/msg01063.html
> That is why I created the current proposal. Certainly allowing the ovs
> timeout to be specified in libvirtd.conf allows much more flexibility
> than the current patch provides.

Ah. Frankly, I like the Laine's idea more. To make this configurable
somewhere in a config file. libvirtd.conf sounds good.

> 
> The system we saw ovs timeout problems on had 128 cpus with a load avg
> of 74 and the system shows 41% idle. I would not call that an
> unreasonable load level on a system

Neither would I. But it is very suspicious that it still takes ovs more
than 5 seconds to create an interface, esp. if ~ a half of those 128
cores is idling. Probably an ovs bug?

> and I also would not expect getting
> errors like:
> error: Disconnected from qemu:///system due to keepalive timeout
> error: Failed to start domain zs93k1g80002
> error: internal error: connection closed due to keepalive timeout
> when trying to start a domain which are caused by ovs command timing
> out. Please notice that the virsh start command itself, for the domain,
> did exceed the keepalive time limit.

Hold on, in case of 'virsh start zs93k1g80002' it's libvirtd who is
starting the domain, not virsh. The keepalive messages are processed in
a different threat than starting up a domain. Therefore, it's not ovs
bug, your machine is just under heavy load if nor libvirtd can reply to
keepalive pings. And thus the connection is closed which results in
domain startup process to be cancelled.

> In addition a later retry with the same amount of load on the system
> succeeded with starting the domain instead of running into the previous
> error again caused by the ovs timeout when an ovs port is created.
> 

That's weird. Is it possible that this is a kernel bug? Or something
weird is happening there.

Michal