RX throttling causes keep-alive timeout

Thu Aug 5 16:00:21 UTC 2021

On Wed, Aug 04, 2021 at 17:56:31 +0100, Daniel P. Berrangé wrote:
> There is no way to parse the keepalives, without also pulling all
> the preceeding data off the socket, which defeats the purpose of
> having the limit.

Thanks for the comments, Daniel.

Likewise, I see no easy way to make keep-alive handled regardless
of the request limits, given that the throttling is being applied
on the socket level.

> IMHO this is a tuning problem for your application. If you are expecting
> to have 5 long running operations happening concurrently in normal usage,
> then you should have increased the max_client_requests parameter to a
> value greater than 5 to give yourself more headroom.

While it's down to the application tuning, shall we consider informing
about throttling in the server logs since it's quite an abnormal situation?
I'm thinking about sending this or similar patch to libvir-list:

diff --git a/src/rpc/virnetserverclient.c b/src/rpc/virnetserverclient.c
index 236702ced6..2bed81a82f 100644
--- a/src/rpc/virnetserverclient.c
+++ b/src/rpc/virnetserverclient.c
@@ -1296,6 +1296,8 @@ static virNetMessage *virNetServerClientDispatchRead(virNetServerClient *client)
                 client->rx->buffer = g_new0(char, client->rx->bufferLength);
                 client->nrequests++;
             }
+        } else {
+            VIR_WARN("RPC throttling, check max_client_requests");
         }
         virNetServerClientUpdateEvent(client);

Seeing a warning can indicate the misconfiguration and potentially predict
latency-related issues, e.g. keep-alive timeouts in this case.

> The right limit is hard to suggest without knowing more about your mgmt
> application. As an example though, if your application can potentially
> have 2 API calls pending per running VM and your host capacity allows
> for 100 VMs, then you might plan for your max_client_requests value
> to be 200. Having a big value for max_client_requests is not inherantly
> a bad thing - we just want to prevent unbounded growth when things go
> wrong.

Thanks for the example. The management application where I'm observing
the issue is quite conservative and applies the throttling on its own at
the request level on the client-side not to overwhelm the libvirtd daemon.
I'm currently planning to get away with doubling the default limit and am
willing to set the guardrails, e.g. the above warnings, so we could revisit
the configuration should this issue occur again.

Thanks,
Ivan