[libvirt] Supporting vhost-net and macvtap in libvirt for QEMU

Thu Jan 21 21:13:53 UTC 2010

.....
>
>>> So I think we want to maintain a concept of the qemu backend (virtio,
>>> e1000, etc), tbhe fd that connects the qemu backend to the host (tap,
>>> socket, macvtap, etc), and the bridge.  The bridge bit gets a little
>>> complicated.  We have the following bridge cases:
>>>
>>> - sw bridge
>>>  - normal existing setup, w/ Linux bridging code
>>>  - macvlan
>>> - hw bridge
>>>  - on SR-IOV card
>>>    - configured to simply fwd to external hw bridge (like VEPA mode)
>>>    - configured as a bridge w/ policies (QoS, ACL, port mirroring,
>>>      etc. and allows inter-guest traffic and looks a bit like above
>>>      sw switch)
>>>  - external
>>>    - need to possibly inform switch of incoming vport
>>
>> I've got mixed feelings here.  With respect to sw vs. hw bridge, I
>> really think that that's an implementation detail that should not be
>> exposed to a user.  A user doesn't typically want to think about whether
>> they're using a hardware switch vs. software switch.  Instead, they
>> approach it from, I want to have this network topology, and these
>> features enabled.
>
> Agree there is alot of low level detail there, and I think it will be
> very hard for users, or apps to gain enough knowledge to make intelligent
> decisions about which they should use. So I don't think we want to expose
> all that detail. For a libvirt representation we need to consider it more
> in terms of what capabilities does each options provide, rather than what
> implementation each option uses
>

Attached is some background information on VEPA bridging being discussed in
this thread and then a proposal for defining it in libvirt xml.

The 'Edge Virtual Bridging'(eVB) working group has proposed a mechanism to
offload the bridging function from the server to a physical switch on
the network. This is referred to as VEPA (Virtual Ethernet Port
Aggregator). This is described here:

http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-0709-v01.pdf

The VEPA mode implies that the virtual machines on a host communicate to each
other via the physical switch on the network instead of the bridge in
the Linux host.  The filtering, quality of service enforcement, stats etc. 
are all done in the external switch.

The newer NICs with embedded switches (such as SR-IOV cards) will
also provide VEPA mode. This implies that the communication between two
virtual functions on the same physical NIC will also require a packet to
travel to the first hop switch on the network and then be reflected back.

The 'macvlan' driver in Linux supports virtual interfaces that can be
attached to virtual machine interfaces. This patch provides tap backend
to macvlan: http://marc.info/?l=linux-kernel&m=125986323631311&w=2. If such 
an interface is used the packets will be forwarded directly onto the 
network bypassing the host bridge. This is exactly what is required for VEPA 
mode.

However,  the 'macvlan' driver can support both VEPA and 'bridging' mode. 
The bridging in this case is among its virtual interfaces only. There is 
also a private mode in which the packets are transmitted to the network
but are not forwarded among the VMs.

Similarly, the sr-iov's embedded switch in the future will be settable
as 'VEPA', or 'private' or 'bridging' mode.

In the eVB working group the 'private' mode is referred to as PEPA, and
the 'bridging' as VEB (Virtual ethernet bridge). I'll use the same
terms.

The 'VEB' mode of macvlan or sr-iov is no different than the bridge
in Linux. The behaviour of the networking/switching on the network is
unaffected.

Changes in the first-hop adjacent Switch on the network:
---------------------------------------------------------
When the 'VEPA' (or PEPA) mode is used the packet switching is occuring on 
the first hop switch. Therefore for VM to VM traffic, the first hop switch 
must support reflecting the packets back on the port on which they were 
received. This is referred to as the 'hairpin' or 'reflective relay'
mode.

The IEEE 802.1 body is standardizing on the protocol with the switch
vendors, and various other server vendors working on the standard. This
is derived from the above mentioned eVB ('edge virtual bridging')
working group.

To enable easy testing the Linux bridge can be put into the 'reflective
relay' (or hairpin) mode. The patches are included in 2.6.32. The mode can 
be set using sysfs or brctl commands (in latest bridge utils bits).

In the future the switch vendors (in eVB group) expect to support both
VEPA and VEB on the same switch port. That is the Linux host can have
some VM's using VEPA mode and some in VEB mode on the same outgoing
uplink. This protocol is to be fully defined and will require more
changes in the bridging function. The ethernet frame will carry tags to
identify the packet streams (for VEPA or VEB ports). See chart 4 in the
above linked IEEE document.

However, from a libvirt defintion point of view it implies that a
'bridge' can be in multiple modes(VEPA or VEB or PEPA). An alternative is 
to define separate bridges handling VEB/VEPA or PEPA modes for the same 
'sr-iov' or 'macvlan' backend.

Determining the switch capability:
---------------------------------
The Linux host can determine (and set) whether the remote bridge
supports 'hairpin' mode and also set this capability through a low level
protocol (DCBx) being extended in the above eVB working group.
Some drivers (for NICs/CNAs) are likely to do this detrmination
themselves and make the information available to the hypervisor/Linux
host.

Summary:
--------

Based on above a virtual machine might be defined to work with the
Linux/hypervisor bridge, with the 'macvlan' in bridge, vepa/pepa modes,
or with sr-iov virtual function with switching in bridge, or vepa/pepa
modes.

Proposal:
--------

To support the above combinations we need to be able to define the bridge
to be used, the 'uplink' it is associated with, and the interface type
that the VM will use to connect to the bridge.

Currently in libvirt we define a bridge and can associate an ethernet
with it (which is the uplink to the network). In the 'macvlan' and the
'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov' it
is embedded in the 'nic', and in the case of macvlan the function is
enabled when the virtual interface is created.

Describing the bridge and modes:
--------------------------------
So, we can define the bridge function using a new type or maybe extend
the bridge.xml itself.

<interface type='bridge' name='br0'>
   <bridge>
    <type='hypervisor|embedded|ethernet'/> //hypervisor is default
    <mode='all|VEPA|PEPA|VEB'/>		  // 'all' is default if supported.
   <interface type='ethernet' name='eth0'/>
    </bridge>
</interface>

The 'type' and 'mode' need not be specified. libvirt could default to the 
virtual bridge in the hypervisor. Similarly, the supported modes may be
determined dynamically by libvirt.

Or, we could invent a new type for macvlan or sr-iov based switching:

<interface type ='physical' name='pbr0'/>
<source dev='eth0'/>
<type='embedded|ethernet'/> 
<mode='all|VEPA|PEPA|VEB'/>	 // all is default if supported.
</interface>

The above two descriptions imply that the bridge may be 'embedded' 
e.g. sr-iov or vmdq nics, standard existing bridging (the VEB), or macvlan 
based.

Describing the VM connectivity:
--------------------------------

With the above, in the domain xml, we can specify:

<interface type='physical'/ or bridge>
<name='br0'/>
<type='macvtap|tap|raw'/> 
<target mode='vepa|pepa|veb'/> //only one can be specified
</interface>

Therefore, when instantiating a guest, libvirt will determine the
type of interface and bridge.  Example: for a 'vepa' mode, for a bridge defined
as 'ethernet', libvirt will create a macvtap interface while setting the 
mode to vepa.

thanks,
 	Vivek

> -- 
> |: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
> |: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
>
> --
> Libvir-list mailing list
> Libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
>

__

Vivek Kashyap
Linux Technology Center, IBM