[libvirt] Re: Supporting vhost-net and macvtap in libvirt for QEMU

Thu Dec 17 22:32:32 UTC 2009

* Anthony Liguori (aliguori at linux.vnet.ibm.com) wrote:
> Chris Wright wrote:
> >* Anthony Liguori (aliguori at linux.vnet.ibm.com) wrote:
> >>There are two modes worth supporting for vhost-net in libvirt.  The
> >>first mode is where vhost-net backs to a tun/tap device.  This is
> >>behaves in very much the same way that -net tap behaves in qemu
> >>today.  Basically, the difference is that the virtio backend is in
> >>the kernel instead of in qemu so there should be some performance
> >>improvement.
> >>
> >>Current, libvirt invokes qemu with -net tap,fd=X where X is an
> >>already open fd to a tun/tap device.  I suspect that after we merge
> >>vhost-net, libvirt could support vhost-net in this mode by just
> >>doing -net vhost,fd=X.  I think the only real question for libvirt
> >>is whether to provide a user visible switch to use vhost or to just
> >>always use vhost when it's available and it makes sense.
> >>Personally, I think the later makes sense.
> >
> >Doesn't sound useful.  Low-level, sure worth being able to turn things
> >on and off for testing/debugging, but probably not something a user
> >should be burdened with in libvirt.
> >
> >But I dont' understand  your -net vhost,fd=X, that would still be -net
> >tap=fd=X, no?  IOW, vhost is an internal qemu impl. detail of the virtio
> >backend (or if you get your wish, $nic_backend).
> 
> I don't want to get bogged down in a qemu-devel discussion on
> libvirt-devel :-)

The reason I brought it up here is in case libvirt would be doing both.
/dev/vhost takes an fd for a tap device or raw socket.  So libvirt would
need to open both, and then becomes a question of whether libvirt only
passes the single vhost fd (after setting it up completely) or passes
both the vhost fd and connecting fd for qemu to put the two together.
I didn't recall migration (if qemu would need tap fd again).

> But from a libvirt perspective, I assume that it wants to open up
> /dev/vhost in order to not have to grant the qemu instance
> privileges which means that it needs to hand qemu the file
> descriptor to it.
> 
> Given a file descriptor, I don't think qemu can easily tell whether
> it's a tun/tap fd or whether it's a vhost fd.  Since they have
> different interfaces, we need libvirt to tell us which one it is.
> Whether that's -net tap,vhost or -net vhost, we can figure that part
> out on qemu-devel :-)

Yeah, I agree, just thinking of the workflow as it impacts libvirt.

> >>The more interesting invocation of vhost-net though is one where the
> >>vhost-net device backs directly to a physical network card.  In this
> >>mode, vhost should get considerably better performance than the
> >>current implementation.  I don't know the syntax yet, but I think
> >>it's reasonable to assume that it will look something like -net
> >>tap,dev=eth0.   The effect will be that eth0 is dedicated to the
> >>guest.
> >
> >tap?  we'd want either macvtap or raw socket here.
> 
> I screwed up.  I meant to say, -net vhost,dev=eth0.  But maybe it
> doesn't matter if libvirt is the one that initializes the vhost
> device, setups up the raw socket (or macvtap), and hands us a file
> descriptor.

Ah, gotcha, yeah.

> In general, I think it's best to avoid as much network configuration
> in qemu as humanly possible so I'd rather see libvirt configure the
> vhost device ahead of time and pass us an fd that we can start
> using.

Hard to disagree, but will make qemu not work w/out libvirt?

> >>On most modern systems, there is a small number of network devices
> >>so this model is not all that useful except when dealing with SR-IOV
> >>adapters.  In that case, each physical device can be exposed as many
> >>virtual devices (VFs).  There are a few restrictions here though.
> >>The biggest is that currently, you can only change the number of VFs
> >>by reloading a kernel module so it's really a parameter that must be
> >>set at startup time.
> >>
> >>I think there are a few ways libvirt could support vhost-net in this
> >>second mode.  The simplest would be to introduce a new tag similar
> >>to <source network='br0'>.  In fact, if you probed the device type
> >>for the network parameter, you could probably do something like
> >><source network='eth0'> and have it Just Work.
> >
> >We'll need to keep track of more than just the other en
> >We need to 0
> 
> Is something missing here?

I got to it below.  Just noting that libvirt will need to track each
piece, the backend (virtio), the connector (tap,socket), and any bridge
setup.

> >>Another model would be to have libvirt see an SR-IOV adapter as a
> >>network pool whereas it handled all of the VF management.
> >>Considering how inflexible SR-IOV is today, I'm not sure whether
> >>this is the best model.
> >
> >We already need to know the VF<->PF relationship.  For example, don't
> >want to assign a VF to a guest, then a PF to another guest for basic
> >sanity reasons.  As we get better ability to manage the embedded switch
> >in an SR-IOV NIC we will need to manage them as well.  So we do need
> >to have some concept of managing an SR-IOV adapter.
> 
> But we still need to support the notion of backing a VNIC to a NIC,
> no?  If this just happens to also work with a naive usage of SR-IOV,
> is that so bad? :-)

Nope, not at all ;-)

We do need to know if a VF is available or not (and if a PF has any of
its VFs used).  Needed on migration ("can I hook up to a VF on target?"),
and for assignment ("can I give this PCI device to a guest?  wait, it's
a PF and VF's are in use." Although, I don't think libvirt actually goes
beyond, "wait it's a PF").

> Long term, yes, I think you want to manage SR-IOV adapters as if
> they're a network pool.  But since they're sufficiently inflexible
> right now, I'm not sure it's all that useful today.
> 
> >So I think we want to maintain a concept of the qemu backend (virtio,
> >e1000, etc), tbhe fd that connects the qemu backend to the host (tap,
> >socket, macvtap, etc), and the bridge.  The bridge bit gets a little
> >complicated.  We have the following bridge cases:
> >
> >- sw bridge
> >  - normal existing setup, w/ Linux bridging code
> >  - macvlan
> >- hw bridge
> >  - on SR-IOV card
> >    - configured to simply fwd to external hw bridge (like VEPA mode)
> >    - configured as a bridge w/ policies (QoS, ACL, port mirroring,
> >      etc. and allows inter-guest traffic and looks a bit like above
> >      sw switch)
> >  - external
> >    - need to possibly inform switch of incoming vport
> 
> I've got mixed feelings here.  With respect to sw vs. hw bridge, I
> really think that that's an implementation detail that should not be
> exposed to a user.  A user doesn't typically want to think about
> whether they're using a hardware switch vs. software switch.
> Instead, they approach it from, I want to have this network
> topology, and these features enabled.

libvirt needs to know what to do w/ the switch.  Ideally...all would
show up in Linux with the same mgmt interface, then libvirt would just
apply a port profile to a port on a switch, we aren't there now.

> I think the notion of network pools as being somewhat opaque really
> works well for this.  Ideally you would create a network pool based
> on the requirements you had, and the management tool would figure
> out what the best set of implementations to use was.
> 
> VEPA is really a unique use-case in my mind.  It's when someone
> wants to use an external switch for their network management.

It's an enterprise thing, sure, but we need to be able to manage.
Ditto for a VN-Tag approach.  They all require some basic setup.

thanks,
-chris