[Libvir] PATCH 0/2: Support QEMU (+KVM) in libvirt

Fri Jan 5 21:14:30 UTC 2007

The following series of (2) patches adds a QEMU driver to libvirt. The first patch
provides a daemon for managing QEMU instances, the second provides a driver letting
libvirt manage QEMU via the daemon. 

Basic architecture
------------------

The reason for the daemon architecture is two fold:

 - At this time, there is no (practical) way to enumerate QEMU instances, or
   reliably connect to the monitor console of an existing process. There is
   also no way to determine the guest configuration associated with a daemon.

 - It is desirable to be able to manage QEMU instances using either an unprivilegd
   local client, or a remote client. The daemon can provide connectivity via UNIX
   domain sockets, or IPv4 / IPv6 and layer in suitable authentication / encryption
   via TLS and/or SASL protocols.

Anthony Ligouri is working on patches for QEMU with the goal of addressing the 
first point. For example, an extra command line argument will cause QEMU to save
a PID file and create a UNIX socket for its monitor at a well-defined path. More
functionality in the monitor console will allow the guest configuration to be
reverse engineered from a running guest. Even with those patches, however, it will
still be desirable to have a daemon to provide more flexible connectivity, and to
facilitate implementation libvirt APIs which are host (rather than guest) related.
Thus I expect that over time we can simply enhance the daemon to take advantage of
newer capabilities in the QEMU monitor, but keep the same basic libvirt driver
architecture.

Considering some of the other hypervisor technologies out there, in particular 
User Mode Linux, and lhype, it may well become possible to let this QEMU daemon
also provide the management of these guests - allowing re-use of the single driver
backend in the libvirt client library itself.

XML format
----------

As discussed in the previous mail thread, the XML format for describing guests
with the QEMU backend is the same structure as that for Xen guests, with 
following enhancements:

  - The 'type' attribute on the top level <domain> tag can take one of the
    values 'qemu', 'kqemu' or 'kvm' instead of 'xen'. This selects between
    the different virtualization approaches QEMU can provide.

  - The '<type>' attribute within the <os> block of the XML (for now) is 
    still expected to the 'hvm' (indicating full virtualization), although
    I'm trying to think of a better name, since its not technically hardware
    accelerated unless you're using KVM

  - The '<type>' attribute within the <os> block of the XML can have two
    optional 'arch' and 'machine' attributes. The former selects the CPU
    architecture to be emulated; the latter the specific machine to have
    QEMU emulate (determine those supported by QEMU using 'qemu -M ?').

  - The <kernel>, <initrd>, <cmdline> elements can be used to specify 
    an explicit kernel to boot off[1], otherwise it'll do a boot of the 
    cdrom, harddisk / floppy (based on <boot> element). Well,the kernel
    bits are parsed at least. I've not got around to using them when 
    building the QEMU argv yet.

  - The disk devices are configured in same way as Xen HVM guests. eg you
    have to use  hda -> hdd, and/or fda -> fdb. Only hdc can be selected
    as a cdrom device.

  - The network configuration is work in progress. QEMU has many ways to
    setup networking. I use the 'type' attribute to select between the
    different approachs 'user', 'tap', 'server', 'client', 'mcast' mapping
    them directly onto QEMU command line arguments. You can specify a
    MAC address as usual too. I need to implement auto-generation of MAC
    addresses if omitted. Most of them have extra bits of metadata though
    which I've not figured out appropriate XML for yet. Thus when building
    the QEMU argv I currently just hardcode 'user' networking.

  - The QEMU binary is determined automatically based on the requested
    CPU architecture, defaulting to i686 if non specified. It is possible
    to override the default binary using the <emulator> element within the
    <devices> section. This is different to previously discussed, because
    recent work by Anthony merging VMI + KVM to give paravirt guests means
    that the <loader> element is best kept to refer to the VMI ROM (or other
    ROM like files :-) - this is also closer to Xen semantics anyway.

Connectivity
------------

The namespace under which all connection URIs come is 'qemud'. Thereafter
there are several options. First, two well-known local hypervisor
connections

  - qemud:///session

    This is a per-user private hypervisor connection. The libvirt daemon and
    qemu guest processes just run as whatever UNIX user your client app is 
    running. This lets unprivileged users use the qemu driver without needing 
    any kind admin rights. Obviously you can't use KQEMU or KVM accelerators
    unless the /dev/ device node is chmod/chown'd to give you access.

    The communication goes over a UNIX domain socket which is mode 0600 created
    in the abstract namespace at $HOME/.qemud.d/sock.

  - qemud:///system

    This is a system-wide privileged hypervisor connection. There is only one
    of these on any given machine. The libvirt_qemud daemon would be started
    ahead of time (by an init script), possibly running as root, or maybe under
    a dedicated system user account (and the KQEMU/KVM devices chown'd to match).
    The admin would optionally also make it listen on IPv4/6 addrs to allow
    remote communication. (see next URI example)

    The local communication goes over one of two possible UNIX domain sockets
    Both in the abstract namespace under the directory /var/run. The first socket
    called 'qemud' is mode 0600, so only privileged apps (ie root) can access it,
    and gives full control capabilities. The other called 'qemud-ro'  is mode 0666 
    and any clients connecting to it will be restricted to only read-only libvirt 
    operations by the server.

  - qemud://hostname:port/

    This lets you connect to a daemon over IPv4 or IPv6. If omitted the port is
    8123 (will probably change it). This lets you connect to a system daemon
    on a remote host - assuming it was configured to listen on IPv4/6 interfaces.
    Currently there is zero auth or encryption, but I'm planning to make it 
    mandortory to use the TLS protocol - using the GNU TLS library. This will give
    encryption, and mutual authentication using either x509 certificates or
    PGP keys & trustdbs or perhaps both :-) Will probably start off by implementing
    PGP since I understand it better.

    So if you wanted to remotely manage a server, you'd copy the server's 
    certificate/public key to the client into a well known location. Similarly
    you'd generate a keypair for the client & copy its public key to the 
    server. Perhaps I'll allow clients without a key to connect in read-only
    mode. Need to prototype it first and then write up some ideas.

Server architecture
-------------------

The server is a fairly simple beast. It is single-threaded using non-blocking I/O
and poll() for all operations. It will listen on multiple sockets for incoming
connections. The protocol used for client-server comms is a very simple binary
message format close to the existing libvirt_proxy. Client sends a message, server
receives it, performs appropriate operation & sends a reply to the client. The
client (ie libvirt driver) blocks after sending its message until it gets a reply.
The server does non-blocking reads from the client buffering until it has a single
complete message, then processes it and populates the buffer with a reply and does
non-blocking writes to send it back to the client. It won't try to read a further
message from the client until its sent the entire reply back. ie, it is a totally
synchronous message flow - no batching/pipelining of messages.  During the time
the server is processes a message it is not dealing with any other I/O, but thus
far all the operations are very fast to implement, so this isn't a serious issue,
and there ways to deal with it if there are operations which turn out to take a
long time. I certainly want to avoid multi-threading in the server at all costs!

As well as monitoring the client & client sockets, the poll() event loop in the
server also captures stdout & stderr from the QEMU processes. Currently we just
dump this to stdout of the daemon, but I expect we can log it somewhere. When we
start accessing the QEMU monitor there will be another fd in the event loop - ie
the pseduo-TTY  (or UNIX socket) on which we talk to the monitor.

Inactive guests
---------------

Guests created using 'virsh create'  (or equiv API) are treated as 'transient'
domains - ie their config files are not saved to disk. This is consistent with
the behaviour in the Xen backend. Guests created using 'virsh define', however,
are saved out to disk in $HOME/.qemud.d for the per-user session daemon. The
system-wide daemon should use /etc/qemud.d, but currently its still /root/.qemud.d
The config files are simply saved as the libvirt  XML blob ensuring no data
conversion issues. In any case, QEMU doesn't currently have any config file 
format we can leverage. The list of inactive guests is loaded at startup of the
daemon. New config files are expected to be created via the API - files manually
created in the directory after initial startup are not seen. Might like to change
this later.

XML Examples
------------

This is a guest using plain qemu, with x86_64 architecture and a ISA-only
(ie no PCI) machine emulation. I was actually running this on a 32-bit
host :-) VNC is configured to run on port 5906. QEMU can't automatically
choose a VNC port, so if one isn't specified we assign one based on the
domain ID. This should be fixed in QEMU....

<domain type='qemu'>
  <name>demo1</name>
  <uuid>4dea23b3-1d52-d8f3-2516-782e98a23fa0</uuid>
  <memory>131072</memory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='isapc'>hvm</type>
  </os>
  <devices>
    <disk type='file' device='disk'>
      <source file='/home/berrange/fedora/diskboot.img'/>
      <target dev='hda'/>
    </disk>
    <interface type='user'>
      <mac address='24:42:53:21:52:45'/>
    </interface>
    <graphics type='vnc' port='5906'/>
  </devices>
</domain>

A second example, this time using KVM acceleration. Note how I specify a
non-default path to QEMU to pick up the KVM build of QEMU. Normally KVM
binary will default to /usr/bin/qemu-kvm - this may change depending on
how distro packaging of KVM turns out - it may even be merged into regular
QEMU binaries.

<domain type='kvm'>
  <name>demo2</name>
  <uuid>4dea24b3-1d52-d8f3-2516-782e98a23fa0</uuid>
  <memory>131072</memory>
  <vcpu>1</vcpu>
  <os>
    <type>hvm</type>
  </os>
  <devices>
    <emulator>/home/berrange/usr/kvm-devel/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <source file='/home/berrange/fedora/diskboot.img'/>
      <target dev='hda'/>
    </disk>
    <interface type='user'>
      <mac address='24:42:53:21:52:45'/>
    </interface>
    <graphics type='vnc' port='-1'/>
  </devices>
</domain>

Outstanding work
----------------

  - TLS support. Need to add TLS encryption & authentication to both the client
    and server side for IPv4/6 communications. This will obviously add a dependancy
    on libgnutls.so in libvirt & the daemon. I don't consider this a major problem
    since every non-trivial network app these days uses TLS. The other possible impl
    of OpenSSL has GPL-compatability issues, so is not considered.

  - Change the wire format to use fixed size data types (ie, int8, int16, int32, etc)
    instead of the size-dependant  int/long types. At same time define some rules for
    the byte ordering. Client must match server ordering ? Server must accept client's
    desired ordering ?  Everyone must use BE regardless of server/client format ? I'm
    inclined to say client must match server, since it distributes the byte-swapping
    overhead to all clients and lets the common case of x86->x86 be a no-op.

  - Add a protocol version message as first option to let use protocol at will later
    while maintaining compat with older libvirt client libraries.

  - Improve support for describing the various QEMU network configurations

  - Finish boot options - boot device order & explicit kernel

  - Open & use connection to QEMU monitor which will let us implement pause/resume,
    suspend/restore drivers, and device hotplug / media changes.

  - Return sensible data for virNodeInfo - will need to have operating system dependant
    code here - parsing /proc for Linux to determine available RAM & CPU speed. Who
    knows what for Solaris /  BSD ?!? Anyone know of remotely standard ways for doing
    this. Accurate host memory reporting is the only really critical data item we need.

  - There is a fair bit of duplicate in various helper functions between the daemon,
    and various libvirt driver backends. We should probably pull this stuff out into
    a separate lib/ directoy, build it into a static library and then link that into
    both libvirt, virsh & the qemud daemon as needed.

Dan.
-- 
|=- Red Hat, Engineering, Emerging Technologies, Boston.  +1 978 392 2496 -=|
|=-           Perl modules: http://search.cpan.org/~danberr/              -=|
|=-               Projects: http://freshmeat.net/~danielpb/               -=|
|=-  GnuPG: 7D3B9505   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505  -=|