[libvirt] [PATCH] daemon: Dynamically create worker threads when some get stuck

Fri Jun 17 12:26:06 UTC 2011

On Fri, Jun 17, 2011 at 12:46:33PM +0200, Jiri Denemark wrote:
> On Fri, Jun 17, 2011 at 10:55:43 +0100, Daniel P. Berrange wrote:
> > On Thu, Jun 16, 2011 at 04:03:36PM -0400, Dave Allan wrote:
> > > Dan, can you suggest some possible strategies here?  I don't have a
> > > strong opinion on the implementation, although I agree with your
> > > concern about spawning unlimited numbers of threads.  
> > 
> > As I mentioned, we need to make the QEMU monitor timeout after some
> > period of time waiting, and ensure that the monitor for that VM cannot
> > be used thereafter.
> 
> I'm not sure that's the best way to deal with this either. I hate this kind of
> timeouts since I worked on Xen :-) The problem with this timeout is that no
> matter how big the timeout is, it is usually pretty easy to get into a
> situation when the timeout is not big enough. If anything in the system goes
> crazy (easiest is just causing lots of disk writes) the monitor command times
> out and you cannot do nothing with the domain except for destroying it (or
> shutting it down from inside) even though you fixed the issue and the system
> returns back to normal operation.

Depending on the commands being issued it would be possible to recover
after a monitor timeout. eg for a 'query-XXXX' command, we could likely
just discard any reply that arrives after a timeout. For things like
'cont', 'stop' we can rely on the fact that QEMU notifies us when the
VM actually pauses/resumes via async events. Migrationm, hotplug &unplug
is the main problem area where you end up in a tricky state to recover
from.

> Another issue is that the threads don't have to be stuck in QEMU monitor after
> all, they can be doing migration, for example. Let's say you one client
> connects to libvirtd and starts 5 migrations. Then 15 other clients connect
> and each issues 1 additional migration. So we have 16 clients connected and
> all 20 threads consumed. So even though a new client can connect to libvirtd
> it can't do anything (not even cancel the migrations) since no worker is free.
> I know this is not a probable scenario but I just wanted to show that we need
> to think about more possibilities how libvirtd can become unresponsive.

IMHO, this scenario you describe illustrates the configuration parameters
operating as intended, by limiting the number of concurrent API requests that
are processed.

What I do think we need is an administrative interface to libvirtd. We ought
to be able to connect to libvirtd without opening a hypervisor connection,
and do various admin operations. In particular I'd like to be able to change
logging filters + outputs on the fly & kill off active client connections.
We could also make it possible to increase the thread limits on the fly.
Adding a administrative interface, separate from the main RPC interface
is one of the goals of the RPC rewrite patches I've posted many times.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|