[libvirt] [RFC][scale] new API for querying domains stats

Tue Jul 1 09:19:04 UTC 2014

On 01.07.2014 09:09, Francesco Romani wrote:
> Hi everyone,
>
> I'd like to discuss possible APIs and plans for new query APIs in libvirt.
>
> I'm one of the oVirt (http://www.ovirt.org) developers, and I write code for VDSM;
> VDSM is the node management daemon, which is in charge, among many other things, to
> gather the host and statistics per Domain/VM.
>
> Right now we aim for a number of VM per node in the (few) hundreds, but we have big plans
> to scale much more, and to possibly reach thousands in a not so distant future.
> At the moment, we use one thread per VM to gather the VM stats (CPU, network, disk),
> and of course this obviously scales poorly.

I think this is your main problem. Why not have only one thread that 
would manage list of domains to query and issue the APIs periodically 
instead of having one thread per domain?

>
> This is made only worse by the fact that VDSM is a python 2.7 application, and notoriously
> python 2.x behaves very badly with threads. We are already working to improve our code,
> but I'd like to bring the discussion here and see if and when the querying API can be improved.
>
> We currently use these APIs for our sempling:
>    virDomainBlockInfo
>    virDomainGetInfo
>    virDomainGetCPUStats
>    virDomainBlockStats
>    virDomainBlockStatsFlags
>    virDomainInterfaceStats
>    virDomainGetVcpusFlags
>    virDomainGetMetadata
>
> What we'd like to have is
>
> * asynchronous APIs for querying domain stats (https://bugzilla.redhat.com/show_bug.cgi?id=1113106)
>    This would be just awesome. Either a single callback or a different one per call is fine
>    (let's discuss this!).
>    please note that we are much more concerned about thread reduction then about performance
>    numbers. We had report of thread number becoming a real harm, while performance so far
>    is not yet a concern (https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c54)

I'm not a big fan of this approach. I mean, IIRC python has this Big 
Python Lock, which effectively prevents two threads run concurrently. So 
while in C this would make perfect sense, it doesn't do so in python. 
The callbacks would be called from the event loop, which given how 
frequently you dump the info will block other threads. Therefore I'm 
afraid the approach would not bring any speed up, rather slow down.

>
> * bulk APIs for querying domain stats (https://bugzilla.redhat.com/show_bug.cgi?id=1113116)
>    would be really welcome as well. It is quite independent from the previous bullet point
>    and would help us greatly with scale.

I think this one looks better. Especially if you consider my suggestion 
of having only one thread to serve all domains.

>
> So, I'd like to discuss if these additions are (or can be) in the project roadmap,
> and, if so, how the API could look like and what the possible timeframe could be.
> Of course I'd be happy to provide any further information about VDSM and its workings.
>
> Thoughts very welcome!
>
> Thanks and best regards,
>

Michal