[libvirt] RFC: Death to the BQDL (Big QEMU Driver Lock)

Wed Dec 12 19:31:26 UTC 2012

On Wed, Dec 12, 2012 at 06:29:35PM +0100, Martin Kletzander wrote:
> On 12/12/2012 04:45 PM, Daniel P. Berrange wrote:
> > Many years ago when the QEMU driver was first written for libvirt the
> > daemon was single threaded, so we didn't have to worry about locking
> > at all. Then we introduced threads and so we had to have locking.
> > 
> > Since then locking has been done at two levels. The big QEMU driver
> > lock must be acquired first, then the per-VM lock could be acquired.
> > If no field in the driver struct was used, we could quickly
> > release the QEMU driver lock and only hold the per-VM lock.
> > 
> > Over time though, our code has got more & more complicated to the
> > point where we have to hold the big QEMU driver lock in the vast
> > majority of methods. We mitigate this by releasing the locks when
> > sleeping on monitor API calls, but this is still a huge bottleneck.
> > This is particularly apparent when you look at concurrency when
> > starting/stopping guests.
> > 
> > This scalability limitation is at a point where it is unacceptable
> > for us to continue as we do today.
> > 
> > Band-aids no longer suffice. To fix this for the long term, we need
> > to dramatically change the locking approach we use in the QEMU driver.
> > The only way we can achieve this is by dramatically changing the way
> > we update/access QEMU driver state.
> > 
> > The core problem is that the qemu driver struct holds a vast array
> > of (often unrelated) data, with differing access patterns. It is
> > clear that a single lock is not a suitable level of granularity for
> > this. Looking at what is in the struct we can classify data into a
> > number of buckets
> > 
> >  1. Read-only data that never changes for lifetime of libvirtd.
> >     eg
> > 
> >      - char *configDir
> >      - char *libDir
> >      - bool privileged;
> >      - const char *uri;
> >      - virThreadPoolPtr workerPool
> > 
> >  2. Read-only data that never changes for as long as the config
> >     file is not reloaded. (Currently equivalent to previous bucket
> >     since we don't support reloading qemu.conf - we need to support
> >     that in the future)
> >     eg
> > 
> >       - uid_t user;
> >       - uid_t group;
> >       - int dynamicOwnership;
> >       - unsigned int vncTLS :1;
> >       - unsigned int vncSASL :1;
> >       - char *vncListen;
> > 
> >  3. Read-write data that changes at arbitrary times
> >     eg
> > 
> >       - virDomainObjList domains
> >       - size_t nactive
> >       - pciDeviceList *activePciHostdevs;
> >       - usbDeviceList *activeUsbHostdevs;
> >       - virHashTablePtr closeCallbacks;
> > 
> > 
> > My proposal for dealing with things is as follows
> > 
> >  1. Read-only data that never changes for lifetime of libvirtd.
> > 
> >     Turn the current driver mutex into a driver RW-lock. All API
> >     calls will always acquire a read-lock at start, hold it for
> >     the lifetime of their execution and release it at the end.
> >     API calls will never directly acquire write locks.
> > 
> >     The QEMU driver startup / shutdown global initializers will
> >     acquire write-locks. This ensures the daemon can't shutdown
> >     while any APIs are being executed.
> > 
> > 
> >  2. Read-only data that never changes for as long as the config
> >     file is not reloaded.
> > 
> >     Move all of this data out into a new virQEMUDriverConfigPtr
> >     struct, which is an instance of virObject. The virQEMUDriverPtr
> >     will hold the primary reference to the config. The contents
> >     of this object struct will be considered immutable once
> >     initialized.
> > 
> >     When an API needs to access config file, it will obtain a
> >     reference on the config object. Obtaining the reference
> >     will involve acquiring & releasing the driver lock.
> > 
> >     If the QEMU driver needs to reload the config, it will populate
> >     a completely new virQEMUDriverConfigPtr instance, and unref
> >     the existing one.
> > 
> >     Thus access to data in virQEMUDriverConfigPtr can be completely
> >     lockless once an instance has been acquired, despite the possiblity
> >     of the config being updated at an arbitrary time.
> > 
> >     cf  RCU (read-copy-update)
> > 
> >  3. Read-write data that changes at arbitrary times.
> > 
> >     All data that can be changed must be stored in a dedicated
> >     virObject based instance. Each object must provide its own
> >     internal locking mechanisms targetted to the type of data
> >     being stored.
> > 
> >     Some objects may need some re-architecting to allow them to
> >     operate effectively without the protection of the long lived
> >     QEMU driver lock. For example during domain startup, we rely
> >     on the QEMU driver lock to protect against races between the
> >     time we check for an existing VM with (name,uuid), and the
> >     time we actually finish starting the new VM & store it in
> >     the domain list.  To deal with this the virDomainObjList
> >     will need to have some concept of a 'reserved name,uuid'
> >     so safety is ensured, despite not holding a lock for the whole
> >     start operation.
> > 
> > 
> > So I lied slightly when I said this was the death of the big QEMU
> > driver lock. The big QEMU driver lock still exists, but API calls
> > only ever need to have read-locks. Write-locks are only held for
> > libvirtd startup/shutdown, and for the tiny time window it takes
> > to grab a reference to a virQEMUDriverConfigPtr.
> > 
> > Access to config params is completely lockless, even allowing for
> > their live update.
> > 
> > All the remaining exclusive locks will be pushed down into individual
> > objects which need them, hopefully ensuring high concurrency of
> > operation.
> > 
> > 
> > Implementing this all is a non-trivial job, so I envisage the following
> > order of attack
> > 
> >  1. Create the virQEMUDriverConfigPtr object & move config file
> >     parameters into that.
> > 
> >  2. Encapsulte all read-writable state into objects with dedicated
> >     locking
> > 
> >  3. Turn QEMU driver mutex into a read-write lock
> > 
> >  4. Convert all APIs to only hold read-locks on QEMU driver.
> > 
> 
> I already have a working patch for virRWLock with its methods LockRead,
> Lock(Write) and Unlock, that not only replaces QEMU driver's virMutex,
> but is written so it can be used on more places (of course).  With this,
> an idea for new "locking virObject" came to my mind.  A possible
> "inherited class", let's say, that would have the capabilities of
> virObject with RWLock incorporated inside itself with universal
> lock-related methods.  Is that what you had in mind with the "object
> with dedicated locking"?  This would help having the lock on same
> "level" in methods related to the object itself.  For other use cases
> than QEMU driver, I made it as Win32-friendly as I could without being
> able to compile on that platform.

My intent is actually that virObject itself should always have
a virMutex associated with it. The majority of users need a mutex
and it isn't significantly harmful to allocate one even for those
which don't.

> Having each and every read-write property of the driver as an object
> with its own read-write lock (again not Mutex, see below) seems a bit
> cumbersome, though.  I thought splitting it in a logical hierarchy would
> be nicer.  This, of course, might not be possible for most of those
> properties (just an idea).

For the read-write properties I'm anticipating each one should be
properly encapsulated in an opaque class. Take for example, virBitmapPtr
which we use to store VNC port usage. We currently just set/get bits
there. The bitmap doesn't have locking itself & we don't wnat to add
locking to that, but if we create a virPortAllocator class and accessor
methods for acquiring & release ports, its locking can be completely
self-contained.

This is good because now the same class can be used in other drivers
like libxl, without them having to know or care about locking rules
in their driver.

The same principle applies to other read-write things like the domain
list, usb/pci device list, and pretty much everything else that can be
updated in qemu driver.

Not only do we get very fine grained locking by doing this, we actually
significantly simplify / clarify the QEMU driver code. As I described
above, API method impls would basically acquire a QEMU read lock at
start and release it at end. The only other lock it directly has to
care about is the virDomainObjPtr lock. This makes for very simple
rules for people writing code.

> I understand you meant it the same way as I, but just to be sure the
> idea is common, these objects should have read-write locks, not mutexes
> in them, right?  I'm asking, because some of the locks can be held for
> procedures that are most likely to be only reading (e.g. looking up a
> domain).

WRT read/write locks, I don't think we should use them in
virObject. While they are quite appealing when first looking
at them, IMHO, they are actually not very desirable, primarily
because of the problem of writer starvation. I also tend to
view RW locks as a bandaid for designs that could be better
served by using multiple plain mutexes, rather than one rwlock.

> As you said, one of the most apparent bottlenecks is parallel starting
> of QEMU domains.  This is even more troublesome (and apparent) when
> management applications start, for example, requesting statistics for
> each of those domains.  The speed can go down by orders of magnitude in
> such cases.
> 
> I've got two additional questions, if I may, just out of curiosity.
> 
> What category are you planing on keeping the capabilities in.  That
> virQEMUDriverConfigPtr or as one of the self-locking objects?

They would be self-locking, or better yet, lock-free. This can be achieved
if we can ensure that an instance is immutable once initialized. It think
we're actually pretty much there - there is an issue with -help based caps
detection where we have to probe some stuff at VM startup. We work around
that by cloning the qemuCapsPtr instance at VM start.

> I couldn't find out for sure, but it seems like POSIX thread's
> write-locks can be starved by read-locks, which we IMHO don't want since
> the majority of operations will be read-only.  Do you know what's the
> preference, and potentially, will we have to rewrite the locks to be
> greedy the other way around?

I think the fact that you're worried about writer starvation is a sign
that the locking model you're thinking about needs more work.

In the design I outlined, there would be a rw-lock for virQEMUDriverPtr,
but API calls would *only* ever acquire read locks. Write locks would
only be acquired in daemon startup/shutdown. So you never have a writer
starvation problem there.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|