[libvirt] [PATCH 0/4] improve virConnectListAllInterfaces()

Mon Sep 28 08:44:46 UTC 2015

On Fri, Sep 25, 2015 at 01:48:41PM -0400, Laine Stump wrote:
> On 09/25/2015 01:27 PM, Daniel P. Berrange wrote:
> >On Fri, Sep 25, 2015 at 05:22:30PM +0100, Daniel P. Berrange wrote:
> >>On Fri, Sep 25, 2015 at 11:13:52AM -0400, Laine Stump wrote:
> >>>There's a bit of background about this here:
> >>>
> >>>https://www.redhat.com/archives/augeas-devel/2015-September/msg00001.html
> >>>
> >>>In short, virt-manager is calling the virInterface APIs and that ties
> >>>up a libvirt thread (and CPU core) for a very long time on hosts that
> >>>have a large number of interfaces. These patches don't cure the
> >>>problem (I don't know that there really is a cure other than "Don't DO
> >>>that!"), but they do fix a couple of bugs I found while investigating,
> >>>and make a substantial improvement in the amount of time used by
> >>>virConnectListAllInterfaces().
> >>>
> >>>One thing that I wondered about while investigating this - a big use
> >>>of CPU by virConnectListAllInterfaces() comes from the need to
> >>>retrieve the MAC address of every interface. The MAC addresses are
> >>>both
> >>>
> >>>1) returned to the caller in the interface objects and
> >>>
> >>>2) sent to the policykit ACL checking to decide which interfaces to include in
> >>>the list.
> >>>
> >>>I'm 90% confident that
> >>>
> >>>1) most callers don't really care that they're getting the MAC address
> >>>along with interface name (virt-manager, for example, follows up with
> >>>a virInterfaceGetXMLDesc() anyway)), and
> >>>
> >>>2) there is not even a single host *anywhere* that is using libvirt
> >>>policykit ACLs to limit the list of host interfaces viewable by a
> >>>client.
> >>>
> >>>So we could add a flag to not return MAC addresses, which would allow
> >>>cutting down the time to list all devices to something < 1
> >>>second). But this would be at the expense of no longer having the
> >>>possibility to limit the list with policykit according to MAC
> >>>address. Since all host interface information is available to all
> >>>users via the file system, for example, I don't see this as a huge
> >>>issue, but it would change behavior, so I don't feel comfortable doing
> >>>it. I don't like the idea of a single API call taking > 1 minute to
> >>>return either, though. Does anyone have an opinion about this?
> >>Ultimately 500 interfaces, each ifcfg-XXX file 300 bytes in size
> >>on average is only 150 KB of data. Given the amount of data we
> >>are consuming, here I think it is reasonable to expect we can
> >>process than in a tiny fraction of a second. So there's clearly
> >>a serious algorithmic / data structure flaw here if its taking
> >>minutes.
> >>
> >>By the sounds of the thread you quote, its in augeas itself, so I
> >>think we really need to focus on addressing that root cause as a
> >>priority rather than try to work around it.
> >>
> >>As a side note, we might consider adding new API to netcf so that
> >>we can fetch the entire interface set + macs in one api call to
> >>netcf, though I doubt it'd chance performance that much.
> >So, I instrumented the netcf and augeas code to checking timings.
> 
> What did you use? I tried using perf and oprofile, but all I could get them
> to tell me was that a ton of time was being spent in strcmp(), so either it
> couldn't figure out who was the caller due to missing stack frame pointers,
> or I just didn't know the right commandline options. (The last time I did
> any serious profiling I used some custom code (written by someone else at a
> previous employer) that massaged xml format output from oprofile. A lot has
> changed since then.)

When I said "instrumented" what I mean is that I put gettimeofday()
calls either side of the function calls I thought were interesting/
suspicious, and then printf() the delta :-)

> >The aug_get calls time less than a millisecond, as do the various
> >other calls. I found the bulk of the time is actually coming from
> >the netcf function "get_augeas", which in turns call "aug_load"
> >for every single damn netcf function call.
> 
> I remember David Lutterkort talking about exactly that problem several years
> ago and *thought* I remembered that he had put something into augeas to only
> reread the files if they had changed. Has my memory failed me again? Or is
> augeas doing something and netcf just isn't taking advantage of it?
> 
> >Either we need to stop loading congfig files on every fnuction
> >call in netcf, or we need to add a netcf bulk data API call,
> >so that libvirt can load /all/ the data it needs in 1 single
> >API call.
> 
> I much prefer (1) :-)

The main difficulty with doing (1) is that IIRC, the libvirtd daemon
holds is augeas connection open permanently, so we do need some way
to periodically refresh the interface data. We can't just do it when
a client calls virCOnnectOpen, because apps like openstack one a
single connection and then keep it open forever. I guess doing it on
every function call was the easy way to "solve" this.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|