[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt] Supporting vhost-net and macvtap in libvirt for QEMU

On 01/21/2010 03:13 PM, Vivek Kashyap wrote:

So I think we want to maintain a concept of the qemu backend (virtio,
e1000, etc), tbhe fd that connects the qemu backend to the host (tap,
socket, macvtap, etc), and the bridge.  The bridge bit gets a little
complicated.  We have the following bridge cases:

- sw bridge
 - normal existing setup, w/ Linux bridging code
 - macvlan
- hw bridge
 - on SR-IOV card
   - configured to simply fwd to external hw bridge (like VEPA mode)
   - configured as a bridge w/ policies (QoS, ACL, port mirroring,
     etc. and allows inter-guest traffic and looks a bit like above
     sw switch)
 - external
   - need to possibly inform switch of incoming vport

I've got mixed feelings here.  With respect to sw vs. hw bridge, I
really think that that's an implementation detail that should not be
exposed to a user. A user doesn't typically want to think about whether
they're using a hardware switch vs. software switch.  Instead, they
approach it from, I want to have this network topology, and these
features enabled.

Agree there is alot of low level detail there, and I think it will be
very hard for users, or apps to gain enough knowledge to make intelligent decisions about which they should use. So I don't think we want to expose all that detail. For a libvirt representation we need to consider it more in terms of what capabilities does each options provide, rather than what
implementation each option uses

Attached is some background information on VEPA bridging being discussed in
this thread and then a proposal for defining it in libvirt xml.

The 'Edge Virtual Bridging'(eVB) working group has proposed a mechanism to
offload the bridging function from the server to a physical switch on
the network. This is referred to as VEPA (Virtual Ethernet Port
Aggregator). This is described here:


The VEPA mode implies that the virtual machines on a host communicate to each
other via the physical switch on the network instead of the bridge in
the Linux host. The filtering, quality of service enforcement, stats etc. are all done in the external switch.

The newer NICs with embedded switches (such as SR-IOV cards) will
also provide VEPA mode. This implies that the communication between two
virtual functions on the same physical NIC will also require a packet to
travel to the first hop switch on the network and then be reflected back.

The 'macvlan' driver in Linux supports virtual interfaces that can be
attached to virtual machine interfaces. This patch provides tap backend
to macvlan: http://marc.info/?l=linux-kernel&m=125986323631311&w=2. If such an interface is used the packets will be forwarded directly onto the network bypassing the host bridge. This is exactly what is required for VEPA mode.

However, the 'macvlan' driver can support both VEPA and 'bridging' mode. The bridging in this case is among its virtual interfaces only. There is also a private mode in which the packets are transmitted to the network
but are not forwarded among the VMs.

Similarly, the sr-iov's embedded switch in the future will be settable
as 'VEPA', or 'private' or 'bridging' mode.

In the eVB working group the 'private' mode is referred to as PEPA, and
the 'bridging' as VEB (Virtual ethernet bridge). I'll use the same

The 'VEB' mode of macvlan or sr-iov is no different than the bridge
in Linux. The behaviour of the networking/switching on the network is

Changes in the first-hop adjacent Switch on the network:
When the 'VEPA' (or PEPA) mode is used the packet switching is occuring on the first hop switch. Therefore for VM to VM traffic, the first hop switch must support reflecting the packets back on the port on which they were received. This is referred to as the 'hairpin' or 'reflective relay'

The IEEE 802.1 body is standardizing on the protocol with the switch
vendors, and various other server vendors working on the standard. This
is derived from the above mentioned eVB ('edge virtual bridging')
working group.

To enable easy testing the Linux bridge can be put into the 'reflective
relay' (or hairpin) mode. The patches are included in 2.6.32. The mode can be set using sysfs or brctl commands (in latest bridge utils bits).

In the future the switch vendors (in eVB group) expect to support both
VEPA and VEB on the same switch port. That is the Linux host can have
some VM's using VEPA mode and some in VEB mode on the same outgoing
uplink. This protocol is to be fully defined and will require more
changes in the bridging function. The ethernet frame will carry tags to
identify the packet streams (for VEPA or VEB ports). See chart 4 in the
above linked IEEE document.

However, from a libvirt defintion point of view it implies that a
'bridge' can be in multiple modes(VEPA or VEB or PEPA). An alternative is to define separate bridges handling VEB/VEPA or PEPA modes for the same 'sr-iov' or 'macvlan' backend.

Determining the switch capability:
The Linux host can determine (and set) whether the remote bridge
supports 'hairpin' mode and also set this capability through a low level
protocol (DCBx) being extended in the above eVB working group.
Some drivers (for NICs/CNAs) are likely to do this detrmination
themselves and make the information available to the hypervisor/Linux


Based on above a virtual machine might be defined to work with the
Linux/hypervisor bridge, with the 'macvlan' in bridge, vepa/pepa modes,
or with sr-iov virtual function with switching in bridge, or vepa/pepa


To support the above combinations we need to be able to define the bridge
to be used, the 'uplink' it is associated with, and the interface type
that the VM will use to connect to the bridge.

Currently in libvirt we define a bridge and can associate an ethernet
with it (which is the uplink to the network). In the 'macvlan' and the
'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov' it
is embedded in the 'nic', and in the case of macvlan the function is
enabled when the virtual interface is created.

Describing the bridge and modes:
So, we can define the bridge function using a new type or maybe extend
the bridge.xml itself.

<interface type='bridge' name='br0'>
<type='hypervisor|embedded|ethernet'/> //hypervisor is default
<mode='all|VEPA|PEPA|VEB'/>          // 'all' is default if supported.
<interface type='ethernet' name='eth0'/>

Does this really map to how VEPA works?

For a physical bridge, you create a br0 network interface that also has eth0 as a component.

With VEPA and macv{lan,tap}, you do not create a single "br0" interface. Instead, for the given physical port, you create interfaces for each tap device and hand them over. IOW, while something like:

<interface type='bridge' name='br0'>
<interface type='ethernet' name='eth0'/>
<interface type='ethernet' name='eth1'/>

Makes sense, the following wouldn't:

<interface type='bridge' name='br0'>
<bridge mode='VEPA'>
<interface type='ethernet' name='eth0'/>
<interface type='ethernet' name='eth1'/>

I think the only use of the interface tag that would make sense is:

<interface type='ethernet' name='eth0'>

And then in the VM definition, instead of:

<interface type='direct'>
<source physical='eth0'>

You can imagine doing something similar with SR-IOV:

<interface type='ethernet' name='eth0>

and in the guest:

<interface type='direct'>
<source physical='eth0'>

The 'type' and 'mode' need not be specified. libvirt could default to the virtual bridge in the hypervisor. Similarly, the supported modes may be
determined dynamically by libvirt.

Or, we could invent a new type for macvlan or sr-iov based switching:

<interface type ='physical' name='pbr0'/>
<source dev='eth0'/>
<type='embedded|ethernet'/> <mode='all|VEPA|PEPA|VEB'/> // all is default if supported.

IIUC, when you do macvlan/macvtap, there is no 'pbr0' interface. It's fundamentally different than standard bridging and I think ought to be treated differently.


Anthony Liguori

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]