[libvirt] [RFC][scale] new API for querying domains stats

Tue Jul 1 09:47:32 UTC 2014

On 01.07.2014 11:33, Daniel P. Berrange wrote:
> On Tue, Jul 01, 2014 at 11:19:04AM +0200, Michal Privoznik wrote:
>> On 01.07.2014 09:09, Francesco Romani wrote:
>>> Hi everyone,
>>>
>>> I'd like to discuss possible APIs and plans for new query APIs in libvirt.
>>>
>>> I'm one of the oVirt (http://www.ovirt.org) developers, and I write code for VDSM;
>>> VDSM is the node management daemon, which is in charge, among many other things, to
>>> gather the host and statistics per Domain/VM.
>>>
>>> Right now we aim for a number of VM per node in the (few) hundreds, but we have big plans
>>> to scale much more, and to possibly reach thousands in a not so distant future.
>>> At the moment, we use one thread per VM to gather the VM stats (CPU, network, disk),
>>> and of course this obviously scales poorly.
>>
>> I think this is your main problem. Why not have only one thread that would
>> manage list of domains to query and issue the APIs periodically instead of
>> having one thread per domain?
>
> You suffer from round trip time on every API call if you serialize it all
> in a single thread. eg if every API call is 50ms and you want to check
> once per scond, you can only monitor  20 VMs before you take more time than
> you have available. This really sucks when the majority of that 50ms is a
> sleep in poll() waiting for the RPC response.

Unless you have the bulk query API which will take the RTT only once ;)

>
>>> This is made only worse by the fact that VDSM is a python 2.7 application, and notoriously
>>> python 2.x behaves very badly with threads. We are already working to improve our code,
>>> but I'd like to bring the discussion here and see if and when the querying API can be improved.
>>>
>>> We currently use these APIs for our sempling:
>>>    virDomainBlockInfo
>>>    virDomainGetInfo
>>>    virDomainGetCPUStats
>>>    virDomainBlockStats
>>>    virDomainBlockStatsFlags
>>>    virDomainInterfaceStats
>>>    virDomainGetVcpusFlags
>>>    virDomainGetMetadata
>>>
>>> What we'd like to have is
>>>
>>> * asynchronous APIs for querying domain stats (https://bugzilla.redhat.com/show_bug.cgi?id=1113106)
>>>    This would be just awesome. Either a single callback or a different one per call is fine
>>>    (let's discuss this!).
>>>    please note that we are much more concerned about thread reduction then about performance
>>>    numbers. We had report of thread number becoming a real harm, while performance so far
>>>    is not yet a concern (https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c54)
>>
>> I'm not a big fan of this approach. I mean, IIRC python has this Big Python
>> Lock, which effectively prevents two threads run concurrently. So while in C
>> this would make perfect sense, it doesn't do so in python. The callbacks
>> would be called from the event loop, which given how frequently you dump the
>> info will block other threads. Therefore I'm afraid the approach would not
>> bring any speed up, rather slow down.
>
> I'm not sure I agree with your assessment here. If we consider a single
> API call, the time this takes to complete is made up of a number of parts
>
>   1. Time to write() the RPC call to the socket
>   2. Time for libvirtd to process the RPC call
>   3. Time to recv() the RPC reply from the socket
>
>   1. Time to write() the RPC call to the socket
>   2. Time for libvirtd to process the RPC call
>   3. Time to recv() the RPC reply from the socket
>
>   1. Time to write() the RPC call to the socket
>   2. Time for libvirtd to process the RPC call
>   3. Time to recv() the RPC reply from the socket
>   ...and so on..
>
> If the time for item 2 dominates over the time for items 1 & 2 (which
> it should really) then the client thread is going to be sleeping in a
> poll() for the bulk of the duration of the libvirt API call. If we had
> an async API mechanism, then the VDSM time would essentially be consumed
> with
>
>   1. Time to write() the RPC call to the socket
>   2. Time to write() the RPC call to the socket
>   3. Time to write() the RPC call to the socket
>   4. Time to write() the RPC call to the socket
>   5. Time to write() the RPC call to the socket
>   6. Time to write() the RPC call to the socket
>   7. wait for replies to start arriving
>   8. Time to recv() the RPC reply from the socket
>   9. Time to recv() the RPC reply from the socket
>   10. Time to recv() the RPC reply from the socket
>   11. Time to recv() the RPC reply from the socket
>   12. Time to recv() the RPC reply from the socket
>   13. Time to recv() the RPC reply from the socket
>   14. Time to recv() the RPC reply from the socket
>

Well, in the async form you need to account even the time spent in the 
callbacks:

1. write(serial=1, ...)
2. write(serial=2, ...)
..
7. wait for replies
8. recv(serial=x1, ...)   // there's no guarantee on order of replies
9. callback(serial=x1, ...)
10. recv(serial=x2, ...)
11. callback(serial=x2, ....)

And it's the callback times I'm worried about. I'm not saying we should 
not add the callback APIs. What I'm really saying is I have doubts it 
will help python apps. It will definitely help scaling C applications 
though.

> Of course there's a limit to how many outstanding async calls you can
> make before the event loop gets 100% busy processing the responses,
> but I don't think that makes async calls worthless. Even if we had the
> bulk list API calls, async calling would be useful, because it would
> let VDSM fire off requests for disk, net, cpu, mem stats in parallel
> from a single thread.
>
> Regards,
> Daniel
>

Michal