[libvirt] Supporting vhost-net and macvtap in libvirt for QEMU

Mon Jan 25 17:38:15 UTC 2010

On 01/21/2010 03:13 PM, Vivek Kashyap wrote:
> .....
>>
>>>> So I think we want to maintain a concept of the qemu backend (virtio,
>>>> e1000, etc), tbhe fd that connects the qemu backend to the host (tap,
>>>> socket, macvtap, etc), and the bridge.  The bridge bit gets a little
>>>> complicated.  We have the following bridge cases:
>>>>
>>>> - sw bridge
>>>>  - normal existing setup, w/ Linux bridging code
>>>>  - macvlan
>>>> - hw bridge
>>>>  - on SR-IOV card
>>>>    - configured to simply fwd to external hw bridge (like VEPA mode)
>>>>    - configured as a bridge w/ policies (QoS, ACL, port mirroring,
>>>>      etc. and allows inter-guest traffic and looks a bit like above
>>>>      sw switch)
>>>>  - external
>>>>    - need to possibly inform switch of incoming vport
>>>
>>> I've got mixed feelings here.  With respect to sw vs. hw bridge, I
>>> really think that that's an implementation detail that should not be
>>> exposed to a user.  A user doesn't typically want to think about 
>>> whether
>>> they're using a hardware switch vs. software switch.  Instead, they
>>> approach it from, I want to have this network topology, and these
>>> features enabled.
>>
>> Agree there is alot of low level detail there, and I think it will be
>> very hard for users, or apps to gain enough knowledge to make 
>> intelligent
>> decisions about which they should use. So I don't think we want to 
>> expose
>> all that detail. For a libvirt representation we need to consider it 
>> more
>> in terms of what capabilities does each options provide, rather than 
>> what
>> implementation each option uses
>>
>
> Attached is some background information on VEPA bridging being 
> discussed in
> this thread and then a proposal for defining it in libvirt xml.
>
> The 'Edge Virtual Bridging'(eVB) working group has proposed a 
> mechanism to
> offload the bridging function from the server to a physical switch on
> the network. This is referred to as VEPA (Virtual Ethernet Port
> Aggregator). This is described here:
>
> http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-0709-v01.pdf 
>
>
> The VEPA mode implies that the virtual machines on a host communicate 
> to each
> other via the physical switch on the network instead of the bridge in
> the Linux host.  The filtering, quality of service enforcement, stats 
> etc. are all done in the external switch.
>
> The newer NICs with embedded switches (such as SR-IOV cards) will
> also provide VEPA mode. This implies that the communication between two
> virtual functions on the same physical NIC will also require a packet to
> travel to the first hop switch on the network and then be reflected back.
>
> The 'macvlan' driver in Linux supports virtual interfaces that can be
> attached to virtual machine interfaces. This patch provides tap backend
> to macvlan: http://marc.info/?l=linux-kernel&m=125986323631311&w=2. If 
> such an interface is used the packets will be forwarded directly onto 
> the network bypassing the host bridge. This is exactly what is 
> required for VEPA mode.
>
> However,  the 'macvlan' driver can support both VEPA and 'bridging' 
> mode. The bridging in this case is among its virtual interfaces only. 
> There is also a private mode in which the packets are transmitted to 
> the network
> but are not forwarded among the VMs.
>
> Similarly, the sr-iov's embedded switch in the future will be settable
> as 'VEPA', or 'private' or 'bridging' mode.
>
> In the eVB working group the 'private' mode is referred to as PEPA, and
> the 'bridging' as VEB (Virtual ethernet bridge). I'll use the same
> terms.
>
> The 'VEB' mode of macvlan or sr-iov is no different than the bridge
> in Linux. The behaviour of the networking/switching on the network is
> unaffected.
>
> Changes in the first-hop adjacent Switch on the network:
> ---------------------------------------------------------
> When the 'VEPA' (or PEPA) mode is used the packet switching is 
> occuring on the first hop switch. Therefore for VM to VM traffic, the 
> first hop switch must support reflecting the packets back on the port 
> on which they were received. This is referred to as the 'hairpin' or 
> 'reflective relay'
> mode.
>
> The IEEE 802.1 body is standardizing on the protocol with the switch
> vendors, and various other server vendors working on the standard. This
> is derived from the above mentioned eVB ('edge virtual bridging')
> working group.
>
> To enable easy testing the Linux bridge can be put into the 'reflective
> relay' (or hairpin) mode. The patches are included in 2.6.32. The mode 
> can be set using sysfs or brctl commands (in latest bridge utils bits).
>
> In the future the switch vendors (in eVB group) expect to support both
> VEPA and VEB on the same switch port. That is the Linux host can have
> some VM's using VEPA mode and some in VEB mode on the same outgoing
> uplink. This protocol is to be fully defined and will require more
> changes in the bridging function. The ethernet frame will carry tags to
> identify the packet streams (for VEPA or VEB ports). See chart 4 in the
> above linked IEEE document.
>
> However, from a libvirt defintion point of view it implies that a
> 'bridge' can be in multiple modes(VEPA or VEB or PEPA). An alternative 
> is to define separate bridges handling VEB/VEPA or PEPA modes for the 
> same 'sr-iov' or 'macvlan' backend.
>
> Determining the switch capability:
> ---------------------------------
> The Linux host can determine (and set) whether the remote bridge
> supports 'hairpin' mode and also set this capability through a low level
> protocol (DCBx) being extended in the above eVB working group.
> Some drivers (for NICs/CNAs) are likely to do this detrmination
> themselves and make the information available to the hypervisor/Linux
> host.
>
> Summary:
> --------
>
> Based on above a virtual machine might be defined to work with the
> Linux/hypervisor bridge, with the 'macvlan' in bridge, vepa/pepa modes,
> or with sr-iov virtual function with switching in bridge, or vepa/pepa
> modes.
>
>
> Proposal:
> --------
>
> To support the above combinations we need to be able to define the bridge
> to be used, the 'uplink' it is associated with, and the interface type
> that the VM will use to connect to the bridge.
>
> Currently in libvirt we define a bridge and can associate an ethernet
> with it (which is the uplink to the network). In the 'macvlan' and the
> 'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov' it
> is embedded in the 'nic', and in the case of macvlan the function is
> enabled when the virtual interface is created.
>
> Describing the bridge and modes:
> --------------------------------
> So, we can define the bridge function using a new type or maybe extend
> the bridge.xml itself.
>
> <interface type='bridge' name='br0'>
> <bridge>
> <type='hypervisor|embedded|ethernet'/> //hypervisor is default
> <mode='all|VEPA|PEPA|VEB'/>          // 'all' is default if supported.
> <interface type='ethernet' name='eth0'/>
> </bridge>
> </interface>

Does this really map to how VEPA works?

For a physical bridge, you create a br0 network interface that also has 
eth0 as a component.

With VEPA and macv{lan,tap}, you do not create a single "br0" 
interface.  Instead, for the given physical port, you create interfaces 
for each tap device and hand them over.  IOW, while something like:

<interface type='bridge' name='br0'>
<bridge>
<interface type='ethernet' name='eth0'/>
<interface type='ethernet' name='eth1'/>
</bridge>
</interface>

Makes sense, the following wouldn't:

<interface type='bridge' name='br0'>
<bridge mode='VEPA'>
<interface type='ethernet' name='eth0'/>
<interface type='ethernet' name='eth1'/>
</bridge>
</interface>

I think the only use of the interface tag that would make sense is:

<interface type='ethernet' name='eth0'>
<vepa/>
</interface>

And then in the VM definition, instead of:

<interface type='direct'>
<source physical='eth0'>
     ...
</interface>

You can imagine doing something similar with SR-IOV:

<interface type='ethernet' name='eth0>
<sr-iov/>
</interface>

and in the guest:

>
> The 'type' and 'mode' need not be specified. libvirt could default to 
> the virtual bridge in the hypervisor. Similarly, the supported modes 
> may be
> determined dynamically by libvirt.
>
> Or, we could invent a new type for macvlan or sr-iov based switching:
>
> <interface type ='physical' name='pbr0'/>
> <source dev='eth0'/>
> <type='embedded|ethernet'/> <mode='all|VEPA|PEPA|VEB'/>     // all is 
> default if supported.
> </interface>

IIUC, when you do macvlan/macvtap, there is no 'pbr0' interface.  It's 
fundamentally different than standard bridging and I think ought to be 
treated differently.

Regards,

Anthony Liguori