[libvirt] Re: Supporting vhost-net and macvtap in libvirt for QEMU

Thu Dec 17 21:39:05 UTC 2009

Chris Wright wrote:
> * Anthony Liguori (aliguori at linux.vnet.ibm.com) wrote:
>   
>> There are two modes worth supporting for vhost-net in libvirt.  The
>> first mode is where vhost-net backs to a tun/tap device.  This is
>> behaves in very much the same way that -net tap behaves in qemu
>> today.  Basically, the difference is that the virtio backend is in
>> the kernel instead of in qemu so there should be some performance
>> improvement.
>>
>> Current, libvirt invokes qemu with -net tap,fd=X where X is an
>> already open fd to a tun/tap device.  I suspect that after we merge
>> vhost-net, libvirt could support vhost-net in this mode by just
>> doing -net vhost,fd=X.  I think the only real question for libvirt
>> is whether to provide a user visible switch to use vhost or to just
>> always use vhost when it's available and it makes sense.
>> Personally, I think the later makes sense.
>>     
>
> Doesn't sound useful.  Low-level, sure worth being able to turn things
> on and off for testing/debugging, but probably not something a user
> should be burdened with in libvirt.
>
> But I dont' understand  your -net vhost,fd=X, that would still be -net
> tap=fd=X, no?  IOW, vhost is an internal qemu impl. detail of the virtio
> backend (or if you get your wish, $nic_backend).
>   

I don't want to get bogged down in a qemu-devel discussion on 
libvirt-devel :-)

But from a libvirt perspective, I assume that it wants to open up 
/dev/vhost in order to not have to grant the qemu instance privileges 
which means that it needs to hand qemu the file descriptor to it.

Given a file descriptor, I don't think qemu can easily tell whether it's 
a tun/tap fd or whether it's a vhost fd.  Since they have different 
interfaces, we need libvirt to tell us which one it is.  Whether that's 
-net tap,vhost or -net vhost, we can figure that part out on qemu-devel :-)

>> The more interesting invocation of vhost-net though is one where the
>> vhost-net device backs directly to a physical network card.  In this
>> mode, vhost should get considerably better performance than the
>> current implementation.  I don't know the syntax yet, but I think
>> it's reasonable to assume that it will look something like -net
>> tap,dev=eth0.   The effect will be that eth0 is dedicated to the
>> guest.
>>     
>
> tap?  we'd want either macvtap or raw socket here.
>   

I screwed up.  I meant to say, -net vhost,dev=eth0.  But maybe it 
doesn't matter if libvirt is the one that initializes the vhost device, 
setups up the raw socket (or macvtap), and hands us a file descriptor.

In general, I think it's best to avoid as much network configuration in 
qemu as humanly possible so I'd rather see libvirt configure the vhost 
device ahead of time and pass us an fd that we can start using.

>> On most modern systems, there is a small number of network devices
>> so this model is not all that useful except when dealing with SR-IOV
>> adapters.  In that case, each physical device can be exposed as many
>> virtual devices (VFs).  There are a few restrictions here though.
>> The biggest is that currently, you can only change the number of VFs
>> by reloading a kernel module so it's really a parameter that must be
>> set at startup time.
>>
>> I think there are a few ways libvirt could support vhost-net in this
>> second mode.  The simplest would be to introduce a new tag similar
>> to <source network='br0'>.  In fact, if you probed the device type
>> for the network parameter, you could probably do something like
>> <source network='eth0'> and have it Just Work.
>>     
>
> We'll need to keep track of more than just the other en
> We need to 0
>   

Is something missing here?

>> Another model would be to have libvirt see an SR-IOV adapter as a
>> network pool whereas it handled all of the VF management.
>> Considering how inflexible SR-IOV is today, I'm not sure whether
>> this is the best model.
>>     
>
> We already need to know the VF<->PF relationship.  For example, don't
> want to assign a VF to a guest, then a PF to another guest for basic
> sanity reasons.  As we get better ability to manage the embedded switch
> in an SR-IOV NIC we will need to manage them as well.  So we do need
> to have some concept of managing an SR-IOV adapter.
>   

But we still need to support the notion of backing a VNIC to a NIC, no?  
If this just happens to also work with a naive usage of SR-IOV, is that 
so bad? :-)

Long term, yes, I think you want to manage SR-IOV adapters as if they're 
a network pool.  But since they're sufficiently inflexible right now, 
I'm not sure it's all that useful today.

> So I think we want to maintain a concept of the qemu backend (virtio,
> e1000, etc), tbhe fd that connects the qemu backend to the host (tap,
> socket, macvtap, etc), and the bridge.  The bridge bit gets a little
> complicated.  We have the following bridge cases:
>
> - sw bridge
>   - normal existing setup, w/ Linux bridging code
>   - macvlan
> - hw bridge
>   - on SR-IOV card
>     - configured to simply fwd to external hw bridge (like VEPA mode)
>     - configured as a bridge w/ policies (QoS, ACL, port mirroring,
>       etc. and allows inter-guest traffic and looks a bit like above
>       sw switch)
>   - external
>     - need to possibly inform switch of incoming vport
>   

I've got mixed feelings here.  With respect to sw vs. hw bridge, I 
really think that that's an implementation detail that should not be 
exposed to a user.  A user doesn't typically want to think about whether 
they're using a hardware switch vs. software switch.  Instead, they 
approach it from, I want to have this network topology, and these 
features enabled.

I think the notion of network pools as being somewhat opaque really 
works well for this.  Ideally you would create a network pool based on 
the requirements you had, and the management tool would figure out what 
the best set of implementations to use was.

VEPA is really a unique use-case in my mind.  It's when someone wants to 
use an external switch for their network management.

> And, we can have a hybrid.  E.g., no reason one VF can't be shared by a
> few guests.
>   
-- 
Regards,

Anthony Liguori