[libvirt] [PATCH 00/10] network: physical device abstraction aka 'virtual switch'

Tue Jul 5 07:45:48 UTC 2011

This patch is in response to the following bug reports:

 https://bugzilla.redhat.com/show_bug.cgi?id=643947 (RHEL)
 https://bugzilla.redhat.com/show_bug.cgi?id=636106 (upstream)

It is functionally complete, and has gone through rudimentary testing
for bridge networks (host bridge) and direct networks in bridge mode
(macvtap). The patch series doesn't yet include updates to the domain
and network XML documentation, though, so it isn't ready to push.

I am sending it now to get feedback, both on the specifics of the
code, as well as on how it is designed and how it works. I will be
transferring info from my design document (at the end of this message)
into the libvirt doc files during this week, and will have them ready
for the V2 of the series that is sure to be requested.

*****************
(The working design document)

Network device abstraction aka virtual switch - V4
==================================================

The <interface> element of a guest's domain config in libvirt has a
<source> element that describes what resources on a host will be used
to connect the guest's network interface to the rest of the
world. This is very flexible, allowing several different types of
connection (virtual network, host bridge, direct macvtap connection to
physical interface, qemu usermode, user-defined via an external
script), but currently has the problem that unnecessary details of the
host resources are embedded into the guest's config; if the guest is
migrated to a different host, and that host has a different hardware
or network config (or possibly the same hardware, but that hardware is
currently in use by a different guest), the migration will fail.

This document outlines a change to libvirt's network XML that will
allow us to (optionally - old configs will remain valid) remove the
host details from the guest's domain XML (which can move around from
host to host) and place them in the network XML (which remains with a
single host); the domain XML will then use existing config elements to
associate each guest interface with a "network".

The motivating use case for this change is the "direct" connection
type (which uses macvtap for vepa and vnlink connections directly
between a guest and a physical interface, rather than through a
bridge), but it is applicable for all types of connection. (Another
hopeful side effect of this change will be to make libvirt's network
connection model easier to realize on non-Linux hypervisors (eg,
VMWare ESX) and for other network technologies, such as openvswitch,
VDE, and various VPN implementations).

Background
==========

(parts lifted from Dan Berrange's mail on this subject)

Currently <network> supports 3 connectivity modes

 - Non-routed network, separate subnet        (no <forward> element present)
 - Routed network, separate subnet with NAT   (<forward mode='nat'/>)
 - Routed network, separate subnet            (<forward mode='route'/>)

Each of these is implemented in the existing network driver by
creating a bridge device using brctl, and connecting the guest network
interfaces via tap devices (a detail which, now that I've stated it,
you should promptly forget!). All traffic between that bridge and the
outside network is done via the host's IP routing stack (ie, there is
no physical device directly connected to the bridge)

In the future, these two additional routed modes might be useful:

 - Routed network, IP subnetting
 - Routed network, separate subnet with VPN

The core goal of this proposal, though, is to replace type=bridge and
type=direct from the domain interface XML with new types of <network>
definitions so that the domain can just give "type='network'" and have
all the necessary details filled in at runtime. This basically means
we're adding several bridging modes (the submodes of "direct" have
been flattened out here):

 - Bridged network, eth + bridge + tap
 - Bridged network, eth + macvtap + vepa
 - Bridged network, eth + macvtap + private
 - Bridged network, eth + macvtap + passthrough
 - Bridged network, eth + macvtap + bridge

Another "future expansion" could be to add:

 - Bridged network, with VPN

Likewise, support for other technologies, such as openvswitch and VDE
would each be another entry on this list.

(Dan also listed each of the above "+sriov" separately, but that ends
up being handled in an orthogonal manner (by just specifying a pool of
interfaces for a single network), so I'm only giving the abbreviated
list)

I. Changes to domain <interface> element
========================================

In many cases, the <interface> element of the domain XML will be
identical to what is used now when connecting the interface to a
libvirt-style virtual network:

  <interface type='network'>
    <source network='red-network'/>
    <mac address='xx:xx:xx:xx:xx:xx'/>
  </interface>

Depending on the definition of the network "red-network" on the host
the guest was started on / migrated to, this could be either a direct
(macvtap) connection using one of the various direct modes
(vepa/private/bridge/passthrough), a bridge (again, pointed to by the
definition of 'red-network'), or a virtual network (using the current
network definition syntax). This way the same guest could be migrated
not only between macvtap-enabled hosts, but from there to a host using
a bridge, or maybe a host in a remote location that used a virtual
network with a secure tunnel to connect back to the rest of the
red-network.

 (Part of the migration process would of course check that the
destination host had a network of the proper name with adequate
available resources, and fail if it didn't; management software at a
level above libvirt would probably filter a list of candidate
migration destinations based on available networks and any various
details of those networks (eg. it could search for only networks using
vepa for the connection), and only attempt migration to one that had
the matching network available).

<virtualport> element of <interface>
------------------------------------

Since mamy of the attributes/sub-elements of <virtualport> (used by
some modes of "direct" interface connections) are identical for all
interfaces connecting to any given switch, most of the information in
<virtualport> will be optional in the domain's interface definition -
it can be filled in from a similar <virtualport> element that will be
added to the <network> definition.

Some parameters in <virtualport> ("instanceid", for example) must be
unique for every interface, though, so those will still be specified
in the <interface> XML. The two <virtualport> elements will be OR'ed
at runtime to arrive at the actual set of parameters that are
used.

(Open Question: What should be the policy when a parameter is
specified in both places? Should one take precedence? Or should it be
considered an error?)

portgroup attribute of <source>
-------------------------------

The <source> element of an interface definition will be able to
optionally specify a "portgroup" attribute. If portgroup is *NOT*
given, the default portgroup of the network will be used (if a default
is defined, otherwise no portgroup will be used). If portgroup *IS*
specified, the source network must have a portgroup by that name (or
the domain startup/migration will fail), and the attributes of that
portgroup will be used for the connection. Here is an example
<interface> definition that has both a reduced <virtualport> element,
as well as a portgroup attribute:

  <interface type='network'>
    <source network='red-network' portgroup='engineering'/>
    <virtualport type="802.1Qbg">
      <parameters managerid="11" typeid="1193047" typeidversion="2"
                  instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/>
    </virtualport>
    <mac address='de:ad:be:ef:ca:fe'/>
  </interface>

(The specifics of what can be in a portgroup are given below)

storing the actual chosen/running config in state dir
-----------------------------------------------------

Note: the following additions to the XML will only ever be visible in
the statedir copy of the domain config, which is used to keep track of
the state of running domains in case libvirtd is restarted while
domains are still running. The described element cannot be used in a
user-generated config file, and will never be present in a domain
interface config produced by the libvirt public API, nor by "virsh
dumpxml".

In order to remind libvirt about which interfaces are actually in use
in the event that libvirtd is restarted while domains are still
running, the copy of the domain XML stored in "statedir"
(/var/lib/libvirt/qemu/*.xml) may have an extra element <actual>
stored as a subelement of each <interface>:

  <interface type='network'>
    <source network='red-network' portgroup='engineering'/>
    <virtualport type="802.1Qbg">
      <parameters managerid="11" typeid="1193047" typeidversion="2"
                  instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/>
    </virtualport>
    <mac address='de:ad:be:ef:ca:fe'/>
    <actual type='direct'>
      <source dev='eth1' mode='vepa'/>
    </actual>
  </interface>

In short, merging the <auto> element up into <interface> will yield
the full interface as it was actually instantiated for the domain. In
this case, the interface still has a mac address of de:ad:be:ef:ca:fe,
and will use the same <virtualport> parameters, but the actual type of
the interface will be 'direct' (so macvtap will be used to connect the
interface), and the connection will be via physical device eth1 in
vepa mode.

network

II. Changes to <network> definition
===================================

As Dan has pointed out, any additions to <network> must be designed so
that existing management applications (written to understand <network>
prior to these new additions) will at least recognize that the XML
they've been given is for something new that they don't fully
understand. At the same time, the new types of network definition
should attempt to re-use as much of the existing elements/attributes
as possible, both to make it easier to extend these applications, as
well as to make the status displays of un-updated applications make as
much sense as possible.

The new types of network will be specified by extending the choices
for <forward mode='....'>.

The current modes are:

   <forward mode='route|nat'/>

(in addition to not listing any mode, which equates to "isolated")

Here are suggested new modes:

<forward mode='bridge|vepa|private|passthrough'/>

A description of each:

bridge       - equivalent to "<interface type='bridge'>" in the
               interface definition. The bridge device to use would be
               given in the existing <bridge name='xxx'>.

                 or

               <interface type='direct'> ... <source mode='bridge'/>
               (ie, macvtap bridge mode) with the physical interface
               name given in <forward dev='xxx'> or from the pool of
               devices given as subelements of <forward> (see below)

vepa         - same as "<interface type='direct'>..." with <source
               mode='vepa'/>

private      - <interface type='direct'> ... <source mode='private'/>

passthrough  - <interface type='direct'> ... <source mode='passthrough'/>

Interface Pools
---------------

In many cases, a single host network may have multiple physical
network devices associated with it (especially in the case of an
SRIOV-capable ethernet card, which will have several "virtual
functions" associated with a single physical ethernet connection). The
host will at least want to balance the load of multiple guests between
these multiple devices, and may even require (in the case of
passthrough mode, for example) that only a single guest interface be
attached to each host device.

The current specification for <forward> only allows for a single "dev"
attribute, though. In order to support multiple device names, we will
extend <forward> to allow 0 or more <interface> subelements:

  <forward mode='vepa'>
    <interface dev='eth10'/>
    <interface dev='eth11'/>
    <interface dev='eth12'/>
    <interface dev='eth13'/>
  </forward>

Note that, as a convenience, *on output* the first of these elements
will always be a duplicate of the "dev" attribute in <forward>
itself. When sending XML definnitions to libvirt, either a single
interface should be send in <forward>, or a pool of them as
sub-elements, but not both (if you do this, and the first in the pool
matches the one given in <forward>, it will be ignored, but if they
don't match, that is an error).

In the case of mode='passthrough' (as well as mode='private' if the
virtPortProfile has a mode setting of 802.1Qbh), only one guest
interface can be connected to a device at a time. libvirt will keep
track of which devices are in use, and attempt to assign a free
device; failure to assign a device will result in a failure of the
domain to start/migrate. For the other direct modes, libvirt will
simply keep track of the number of guest interfaces currently using
each device, and attempt to keep them balanced.

Portgroups
-----------

A <portgroup> (subelement of <network>) is just a way of easily
putting connections to the network into different classes, with each
class having a different level/type of service. Each <network> can
have multiple <portgroup> elements, and each <portgroup> has a name,
as well as various attributes associated with it. If an interface
definition specifies a portgroup, that portgroup's info will be used
to modify the interface's setup. If no portgroup is given and one of
the network's portgroups has "default='yes'", that default portgroup
will be used. If no portgroup is given in the interface definition,
and there is no default portgroup, then none will be used.

The first thing we will use portgroups for is as an alternate place to
specify <virtualport> parameters:

  <portgroup name='engineering' default='yes'>
    <virtualport type="802.1Qbg">
      <parameters managerid="11" typeid="1193047" typeidversion="2"/>
    </virtualport>
  </portgroup>

Anything that is valid in an interface's <virtualport> is also valid here.

The next thing to specify in a portgroup will be bandwidth limiting /
QoS configuration. Since I don't know exactly what's needed for that,
I won't specify it here.

If anything is specified both directly under <network> and in a
<portgroup>, the value in portgroup will take precedence. (Again -
what will the precedence of items specified in the <interface> be?)

EXAMPLES
--------

Examples of 'red-network' for different types of connections (all of
these would work with minor variations of the interface XML given
above, eg the 'vepa' version would require <virtualport> in the
interface that specified an instanceid, and if the <interface>
specified a portgroup, it would need to also be in the <network>
definition (even if it was empty aside from name).

  <!-- Existing usage - a libvirt virtual network -->
  <network>
    <name>red-network</name>
    <bridge name='virbr0'/>
    <forward mode='route'/>
        ...
  </network>

  <!-- The simplest - an existing host bridge -->
  <network>
    <name>red-network</name>
    <forward mode='bridge'/>
    <bridge name='eth0'/>
  </network>

  <!-- A macvtap connection to a vepa bridge -->
  <network>
    <name>red-network</name>
    <forward mode='vepa' dev='eth10'/>
    <virtualport type='802.1Qbg'>
      <parameters managerid='11' typeid='1193047' typeidversion='2'/>
    </virtualport>
    <!-- NB: if <interface> doesn't specify portgroup, -->
    <!-- 'accounting' is assumed -->
    <portgroup name='accounting'>
      <virtualport>
        <parameters typeid='22'/>
      </virtualport>
    </portgroup>
    <portgroup name='engineering'>
      <virtualport>
        <parameters typeid='33'/>
      </virtualport>
    </portgroup>
  </network>

  <!-- A macvtap passthrough connection (one guest interface per dev) -->
  <network>
    <name>red-network</name>
    <forward mode='passthrough'>
      <interface dev='eth10'/>
      <interface dev='eth11'/>
      <interface dev='eth12'/>
      <interface dev='eth13'/>
      <interface dev='eth14'/>
      <interface dev='eth15'/>
      <interface dev='eth16'/>
      <interface dev='eth17'/>
    </forward>
  </network>

=============

Keeping Track of Interface Usage by Guests
==========================================

While libvirtd is running, each physical interface in a network's pool
will maintain a count of how many guest interfaces are using that
physical interface. Each guest interface will also maintain
information about which network, and which physical interface on that
network, it is using. The following situations could occur:

1) A guest is terminated while libvirtd is running.

   libvirtd will notice this, and decrement the usage count for each
   interface used by the guest, as maintained in the guest's state
   info.

2) The host system is rebooted

   When the libvirt network driver is restarted, no guests will yet be
   running, so the usage count of each physical interface will be 0,
   and get incremented as guests are started up.

3) libvirtd is restarted

   When the network is restarted, the usage count for all physical
   interfaces will be set to 0, just as if the entire system had
   been rebooted. One of two situations might be encountered:

   3a) The guest is still running when libvirtd is restarted. In this
       case, the existing state information of the guest will be examined
       to determine which physical interface usage count to increment.

   3b) The guest has been terminated while libvirtd wasn't present. Since
       the guest is no longer running, its state information will be thrown
       away.