[libvirt] Ongoing work on lock contention in qemu driver?

Thu May 16 17:52:01 UTC 2013

On Thu, May 16, 2013 at 1:18 PM, Daniel P. Berrange <berrange at redhat.com> wrote:
> On Thu, May 16, 2013 at 01:00:15PM -0400, Peter Feiner wrote:
>> > How many CPU cores are you testing on ?  That's a good improvement,
>> > but I'd expect the improvement to be greater as # of core is larger.
>>
>> I'm testing on 12 Cores x 2 HT per code. As I'm working on teasing out
>> software bottlenecks, I'm intentionally running fewer tasks (20 parallel
>> creations) than the number of logical cores (24). The memory, disk and
>> network are also well over provisioned.
>>
>> > Also did you tune /etc/libvirt/libvirtd.conf at all ? By default we
>> > limit a single connection to only 5 RPC calls. Beyond that calls
>> > queue up, even if libvirtd is otherwise idle. OpenStack uses a
>> > single connection for everythin so will hit this. I suspect this
>> > would be why  virConnectGetLibVersion would appear to be slow. That
>> > API does absolutely nothing of any consequence, so the only reason
>> > I'd expect that to be slow is if you're hitting a libvirtd RPC
>> > limit causing the API to be queued up.
>>
>> I hadn't tuned libvirtd.conf at all. I have just increased
>> max_{clients,workers,requests,client_requests} to 50 and repeated my
>> experiment. As you expected, virtConnectGetLibVersion is now very fast.
>> Unfortunately, the median VM creation time didn't change.
>>
>> > I'm not actively doing anything in this area. Mostly because I've got not
>> > clear data on where any remaining bottlenecks are.
>>
>> Unless there are other parameters to tweak, I believe I'm still hitting a
>> bottleneck. Booting 1 VM vs booting 20 VMs in parallel, the times for libvirt
>> calls are
>>
>> virConnectDefineXML*: 13ms vs 4.5s
>> virDomainCreateWithFlags*: 1.8s vs 20s
>>
>> * I had said that virConnectDefineXML wasn't serialized in my first email. I
>>   based that observation on a single trace I looked at :-) In the average case,
>>   virConnectDefineXML is affected by a bottleneck.
>
> virConnectDefineXML would at least hit the possible bottleneck on
> the virDomainObjListAddLocked method. In fact that's pretty much
> the only contended lock I'd expect it to hit. Nothing else that
> it runs has any serious locking involved.

Okay cool, I'll measure this. I'll also try to figure out what
virDomainCreateWithFlags is waiting on.

>> Note that when I took these measurements, I also monitored CPU & disk
>> utilization.
>> During the 20 VM test, both CPU & disk were well below 100% for 97% of the test
>> (i.e., 60s test duration, measured utilization with atop using a 2
>> second interval,
>> CPU was pegged for 2s).
>>
>> > One theory I had was that the virDomainObjListSearchName method could
>> > be a bottleneck, becaue that acquires a lock on every single VM. This
>> > is invoked when starting a VM, when we call virDomainObjListAddLocked.
>> > I tried removing this locking though & didn't see any performance
>> > benefit, so never persued this further.  Before trying things like
>> > this again, I think we'd need to find a way to actually identify where
>> > the true bottlenecks are, rather than guesswork.
>>
>> Testing your hypothesis would be straightforward. I'll add some
>> instrumentation to
>> measure the time spent waiting for the locks and repeat my 20 VM experiment. Or,
>> if there's some systematic lock profiling in place, then I can turn
>> that on and report
>> the results.
>
> There's no lock profiling support built-in to libvirt. I'm not sure
> of the best way introduce such support without it impacting the very
> thing we're trying to test.  Suggestions welcome

A straightforward way to keep lock statistics with low overhead and
w/out affecting concurrency would be to use thread local storage
(TLS). At the end of a run, or periodically, the stats could be
aggregated and reported. Since the stats don't have to be precise,
it's OK to do the aggregation racily.

Simple statistics to keep are

    * For each lock L, the time spent waiting.
    * For each lock L and callsite C, the time spent waiting.

It would probably be sufficient to identify L as the lock's parent
class name. If per-instance stats are necessary, then we could add the
address of the object to the identity of L.

So pseudo code would look something like this:

struct lock_stats {
    map of (lock_class) to unsigned long: wait_time;
    map of (lock_class, stack_trace) to unsigned long: callsite_wait_time;
};

__thread struct lock_stats * lock_stats;

thread_local_storage_init() {
    lock_stats = new lock_stats;
}

/* return microseconds elapsed since some arbitrary start time */
unsigned long timestamp(void) {
    struct timespec timespec;
    clock_gettime(CLOCK_MONOTONIC, &timespec);
    return timespec.tv_sec * 1e6 + timespec.tv_sec / 10;
}

void virObjectLock(void *anyobj) {
    unsigned long start, elapsed;
    virObjectLockablePtr obj = anyobj;

    start = timestamp();
    virMutexLock(&obj->lock);
    elapsed = timestamp() - start;

    lock_stats->wait_time[obj->parent.klass->name] += elapsed
    lock_stats->wait_time[obj->parent.klass->name, get_stack_trace()] += elapsed
}

> Perhaps a systemtap script would do a reasonable job at it though.
> eg record any stack traces associated with long futex_wait() system
> calls or something like that.
>
> Regards,
> Daniel
> --
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|