[libvirt] Redesigning Libvirt: Better supporting non-hypervisor agnostic concepts

Tue Nov 14 17:25:03 UTC 2017

The problem(s)
==============

While a hypervisor agnostic API is useful for some users, it is completely
irrelevant, and potentally even painful, for other users. We made some
concessions to this when we introduced hypervisor specific XML namespaces
and option for hypervisor specific add-on APIs. We tell apps these are all
unsupported for production usage though. IOW, have this pony, but you can
never play with it.

The hypervisor agnostic API approach inevitably took us in a direction where
libvirt (or something below it) is in charge of managing the QEMU process
lifecycle. We can't expose the concept of process management upto the client
application because many hypervisors don't present virtual machine as UNIX
processes, or merely have processes as a secondary concept eg with Xen a QEMU
process is just subservient to the main Xen guest domain concept. Essentially
libvirt expects the application to treat the hypervisor / compute host as a
black box and just rely on libvirt APIs for managing the virtual machines,
because that is the only way to provide a hypervisor agnostic view of a
commpute host. This approach also gives parity of functionality regardless of
whether the management app is on a remote machine, vs colocated locally with
libvirtd. Most of the large scale management applications have ended up with a
model where they have a component on each compute host talking to libvirt locally
over a UNIX socket, with TCP based access only really used for live migration.
Thus the management apps have rarely considered the Linux OS to truely be a black
box when dealing with KVM. To some degree they all peer inside the box, and wish
to take advantage of some of the concepts Linux exposes to integrate with the
hypervisor.

The inability to directly associate a client with the lifecycle of a single
QEMU process has long been a source of frustration to libguestfs. The level
of indirection forced by use of libvirtd does not map well to how libguestfs
wants to use QEMU. Essentially libguestfs isn't trying to use QEMU in a
system mangement scenario, but rather utilize it as an embedded technology
component. As a result, libguestfs still has its own non-libvirt based way
of spawning QEMU which is often used in preference to its libvirt based impl.
Other apps like libvirt-sandbox have faced the same.

When systemd came into existance, finally providing good mechanisms for
process management in Linux machines, we found a tension between what libvirt
wants todo and what systemd wants todo. The best we've managed is
a compromise where libvirt spawns the guest, but then registers it with
systemd. Users can't directly spawn QEMU guests with systemd and then manage
them with libvirt. We've not seen people seriously try to manage QEMU guests
directly with systemd, but it is fair to say that the combination of systemd
and docker have taken away most potential users of libvirt's LXC driver, as apps
managing containers don't want to treat the host as a blackbox, they want to
have more direct control. The failure to get adoption of the LXC driver serves
as a cautionary tale for what could happen to use of the libvirt QEMU driver in
future.

More recently the increasing interest in use of containers is raising new
interesting architectures for the management of processes. In particular the
Kubernetes project can be considerd to provide cluster-wide management of
processes, aka k8s is systemd for data centers. Again there is interest in
using Kubernetes to manage QEMU guests across the data center. The KubeVirt
project is attempting to bridge the conflicting world views of libvirt and
Kubernetes to build a KVM management system to eventually replace both oVirt
and OpenStack. The need to have libvirtd spawn the QEMU processes is causing
severe complications for KubeVirt architecture, causing them to seriously
consider not using libvirt for management KVM. This issue is a major blocking
item for KubeVirt to the extent that they may well have to abandon use of
libvirt to get the process startup & resource integration model they need.

On Linux as far as hypervisor technology is concerned, KVM has won the
battles and the war. OpenStack user surveys have constantly put KVM/QEMU
on top with at least one order of magnitude higher usage than any other
technology. Amazon was always the major reference for usage of Xen in
public cloud and even they appear to be about to pivot to KVM.

IOW, while providing a hypervisor agnostic management API is still a core
competancy of libvirt, we need to embrace the reality that KVM is the
defacto standard on Linux and better enable people to take avantage of its
unique features, because that is where most of our userbase is.

A second example of limitations of the purely hypervisor agnostic approach
to libvirt is the way our API design is fully synchronous. An application
calling a libvirt API blocks until its execution is complete. This approach
was originally driven by the need to integrate directly with various Xen
backend APIs which were also mostly synchronous in design. Later we added
other hypervisor targets which also exposed synchronous APIs. In parallel
though, we added the libvirtd daemon for running stateful hypervisor drivers
like QEMU, LXC, UML, and now Xen. We speak to this over an RPC system that
can handle arbitrarily overlapping asynchronous requests, but then force
it into our synchronous public API. For applications which only care about
using KVM, the ability to use an asynchronous API could be very interesting
as it would no longer force them to spawn large numbers of threads to get
parallel API execution.

The Solution(s)
===============

Currently our long term public stability promise just covers the XML format and
library API. To enable more interesting usage of hypervisor specific concepts it
is important to consider how to provide other options beyond just the current
API and XML formats. IOW, I'm not talking about making QMP command passthrough
or CLI arg passthrough fully supported features, as libvirt's API & XML
abstraction has clear value there.

Rather I'm thinking about more architectural level changes. In particular I want
to try to break down the black box model of the host, to make it possible to
exploit KVM's key distinguishing feature, which is that the guest is just a
normal process. An application that knows how to spawn & reap processes should
be able to launch KVM as if it was just another normal process. This implies that
the application needs the option to handle the fork+exec of KVM, instead of
libvirt, if it so wishes.

I would anticipate a standalone process "libvirt-qemu" that an application can
spawn, providing a normal domain XML file via the command line or stdin. It would
then connect to libvirtd to register its existance and claim its ownership of the
guest name + UUID. Assuming that succeeds, 'libvirt-qemu' would directly spawn
QEMU. In this manner, the QEMU process automatically inherits all the
characteristics of the application that invoked the 'libvirt-qemu' binary. This
means it shares the user / group ID, the security context, the cgroup placement,
the set of kernel namespaces, etc. Libvirt would honour these characteristics by
default, but also have ability to further refine them. For example, it would
honour the initial process CPU pinning, but could still further pin individual
QEMU threads. In the initial implementation I would anticipate that libvirtd
still retains control over pretty much every other aspect of ongoing QEMU
management. ie libvirtd still owns the monitor connection. This means there would
be some assumptions / limitations in functionality in the short term. eg it might
be assumed that while libvirtd & libvirt-qemu can be in different mount
namespaces, they must none the less be able to see the same underlying storage
in their respective namespaces. The next mail in this series, however, takes
things further to move actual driver functionality into libvirt-qemu, at which
point limitations around namespaces would be largely eliminated. This design
would solve the single biggest problem with managing QEMU from apps like
libguestfs, systemd and KubeVirt.

To avoid having 2 divergant launch processes, when libvirtd itself launches a
QEMU process, it would have to use the same "libvirt-qemu' shim to do so. This
would ensure functional equivalance regardless of whether the management app used
the hypervisor agnostic API, or instead used the QEMU specific approach of
running "libvirt-qemu". We made a crude attempt previously to allow apps to run
their own QEMU and have it managed by libvirt, via the virDomainQemuAttach API.
That API design was impossible to ever consider fully supported, because the mgmt
application was still in charge of designing QEMU command line arguments, and it
is impractical for libvirt to cope with an arbitrary set of args. With the new
proposal, we're still using the current libvirt code for converting XML into
QEMU args, so have a predictable configuration for QEMU. Thus the new approach
can provide a fully supported way for applications to spawn QEMU.

This concept of a "libvirt-qemu" shim is not all that far away from the current
"libvirt-lxc" shim we have. With this in mind, it would also be desirable to make
that a fully supported way to spawn LXC processes, which can then be managed by
libvirt. This would make the libvirt LXC driver more interesting for people who
wish to run containers (though it is admittedly to late to really recapture any
significant usage from other container technologies).

As mentioned earlier, if an application is only concerned with managing of KVM
(or other stateful drivers running inside libvirtd), we have scope to be able to
expose a fully asynchronous management API to applications. Such an undertaking
would effectively mean creating an entirely new libvirt client library, to expose
the asynchronous design, and we obvious have to keep the current library around
long term regardless. Creating a new library would also involve creating new
language bindings, which just adds to the work.  Rather than undertake this
massive amount of extra work, I think it is worth considering declaring the RPC
protocol to be a fully supported interface for applications to consume. There are
already projects which are re-implemented the libvirt client API directly ontop
of the RPC protocol, bypassing libvirt.so. We have always strongly discouraged
this, but none the less it has happened. As we have to maintain strong protocol
compatibility on the RPC layer, it is effectively a stable API already. We cannot
ever change it in incompatible manner without breaking our own client library
implementation. So declaring it a formally supported interface for libvirt would
not really involve any significant extra work on our part, just acknowledgement
of the existing reality. It would perhaps involve some documentation work to
assist developers wishing to consume it though. We would also have to outline the
caveats of taking such an approach, which principally involve loosing ability to
use the stateless hypervisor drivers which all live in the libvirt library. This
is not a real issue though, because the people building ontop of the RPC protocol
only care about KVM.

Another example where exposing a KVM specific model might help is wrt live
migration, specifically the initial launch of QEMU on the target host. Our
libvirt migration API doesn't given the application direct control over this
part which has caused apps like OpenStack to jump through considerable hoops
when doing live migration. So just as an application should be able to launch
the initial QEMU process, it should be able to directly launch it ready for
incoming migration, and then trigger live migration to use this pre-launched
virtual machine.

In general the concept is that although the primary libvirt.so API will still
consider the virt host to be a black box, below this libvirt should not be
afraid to open up the black box to applications to expose hypervisor specific
details as fully supported concepts. Applications can opt-in to using this,
or continue to solely use the hypervisor agnostic API, as best fits their
needs.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|