[libvirt] RFC: PCI-Passthrough of SRIOV VF's new forward mode

Fri Feb 10 20:42:06 UTC 2012

On 02/08/2012 06:18 AM, Shradha Shah wrote:
> Hello Laine,
>
> Many Thanks for reviewing the RFC. Please find my reply inline.
>
> On 02/07/2012 02:36 AM, Laine Stump wrote:
>> On 02/06/2012 12:58 PM, Shradha Shah wrote:
>>> RFC: New network forward type pci-passthrough-hybrid
>>>
>>> I saw a couple of posts regarding PCI-Passthrough usage of SRIOV VF's a couple
>>> of weeks ago (20th Jan 2012). Initially I was going to post this RFC along with
>>> a set of patches. I would require a few more days to clean my patches for
>>> submission and hence I would start with an RFC on a new method to
>>> manage PCI-Passthrough of SRIOV VF's.
>>
>> I'm working on something similar, but purely in the domain's device list first.
>> my plan is that PCI passthrough interface devices will be defined as <interface type='hostdev'> (rather than in a <hostdev>), thus allowing config of all the network interface-related things that may be needed without polluting <hostdev> (and yet giving us an anchor where the guest-side PCI address can be fixed so that it remains the same across restarts of the guest). I discussed this in a later email last month:
>>
>> https://www.redhat.com/archives/libvir-list/2012-January/msg00840.html
>>
>> Note that the first message is a proposal I made to use <hostdev> that was discarded, and we later arrived at:
>>
>>   <devices>
>>     <interface type='hostdev'>
> I was thinking more like <interface type='network'> and in the network xml <forward mode='pci-passthrough'/'hostdev'> since I was thinking adding a new mode to the existing enum virNetworkForwardType.

The two approaches are complementary. For example, just as in current
libvirt you can directly define <interface type='bridge'>, you can also
define <interface type='network'> and have the network be <forward
mode='bridge'>. In the end, the interface ends up being connected to a
non-libvirt-managed bridge device.

In this case, you could define <interface type='hostdev'> and directly
specify the host device in the interface definition, or you could define
<interface type='network'>, then have a pool of devices in the network
definition to choose from. In both cases, you would still end up with a
network device that was assigned to the guest via passthrough.

And just like interface type='bridge'and type='direct' were first
implemented directly, and then later support for that type of interface
was added to networks, I think it's logical to do it the same way here,
since the latter can make use of the code written for the former.

> So currently the virNetworkForwardType has vepa, private, bridge, passthrough.
> I was thinking of adding 
> 1) pci-passthrough or hostdev (VF passthrough to the guest, no virtio interface in the guest, as suggested in your previous proposals)
> 2) pci-passthrough-hybrid or hostdev-hybrid (VF passthrough to the guest + virtio interface in the guest to support migration with maximum performance results)

Yep. Agreed. BTW, I'm leaning towards <forward mode='hostdev'> because
what I'm working on will work equally well for USB or PCI devices.

I'm still unclear how the hybrid mode works (although I'm slowly getting
a better picture)

>>       <source dev='eth22'/>
> I was thinking on terms of having the source dev mentioned in the network XML which will suppress any problems we might face while migration. 

Yes, in cases where the installation is large enough people want to migrate.

Of course, there is the problem that the guest may have the PCI device
in some state at the time of migration that qemu doesn't know about, and
so the destination device can't be put into that exact state (for
example, maybe a packet has been written to the device's memory but not
yet transmitted). As a matter of fact, migration is currently forbidden
if there are any attached hostdevs - you must first detach them all,
then migrate, then attach new devices.

Have you actually experimented with migration of a guest that has one of
your devices assigned to the guest via PCI passthrough?

>
> Having a <source dev='eth22> in the domain XML will mean that a similar device needs to be present on the destination host after migration else migration would fail. 

Yes. That's the reason that the ability to define networks with pools of
network devices was added. It doesn't eliminate the usefulness of being
able to directly define it in the <interface> if you have a simple setup.

>>       <mac address='00:16:3e:5d:c7:9e'/>
>>       ...
>>     </interface>
>>   </devices>
>>
>> (see the first response from Paolo in the thread), in many ways returning to
>> the proposal of last August. The above XML will set the
>> MAC address of eth22, potentially associate a 802.1QbX port profile (if there is
>> a<virtualport>  element), decode "eth22" into a PCI device, then attach that
>> device to the guest.
>>
>> It will also be acceptable to specify the source (host side) address as a pci address rather than a net device name (for those cases when the VF isn't bound to a driver and thus has no net device name).
> This sounds like a good idea when using Solarflare network adapter as Solarflare VF's do not have a net device name and operate with PCI addresses.
>
>> My plan has been to first implement this in <interface>, and then add the <forward mode='hostdev'> support in <network> (which would make things especially nice with your new patches to auto-generate the list of devices in the pool by specifying just a PF).
>>
>> Since you have some code already done, maybe we should compare notes - so far I've been working more on rearranging the data structures to accommodate the dual identity of a device that needs to be <interface> for config purposes, but <hostdev> (plus extra functionality) for device attachment purposes.
>>
> The work that we are trying to achieve follows definitely the same path and it would indeed be a great idea to share notes and part of code between ourselves before we submit patches upstream.
>>> Solarflare Ethernet card supports 127 VF's on each port. The MAC address of
>>> each unused VF is 00:00:00:00:00:00 by default. Hence the MAC address of the VF
>>> does not change on every reboot. There is no VF driver on the host. Each VF
>>> does not correspond to an Ethernet device. Instead, VF's are managed using the
>>> PCI sysfs files.
>> It's interesting that you say each VF doesn't correspond to an ethernet device. Is it that it doesn't, or just doesn't have to (but might)? My limited experiences with sriov hardware has been with an Intel 82576 card, which can operate in either fashion (if the igbvf driver is loaded and bound to the VFs, they have a network device name, otherwise they are visible only via the PF).
> Solarflare do not provide a separate VF driver (like the ixgbevf), we provide only a PF driver (sfc) hence the VF doesn't correspond to an ethernet device. 

Interesting. I guess it makes sense to not waste time with a VF driver,
since a host is likely to gain anything from using more than one network
device per port anyway, so the only practical use of the VFs (especially
127 of them!) is by virtual guests.

>>> With the pci-passthrough-hybrid model when the VF is passed into the guest,
>>> it appears in the guest as a PCI device and not as a network device. A virtual
>>> network device in the form of a virtio interface is also present
>>> in the guest. The virtio device in the guest comes from either bridging the
>>> physical network device or by creating a macvtap interface of type (vepa,
>>> private, bridge) on the physical network device. The virtio device
>>> and the VF bind together in the guest to create an accelerated and a
>>> non-accelerated path.
>> Now *this* is something I've not heard of before. Are you saying you attach the PCI device of the VF through to the guest, and also at the same time have a virtio device that cooperates with the passed-through PCI device? 
> Yes, that is correct. In the hybrid model the guest will have a virtio network device along with a passthrough VF as a PCI device. 
>
>> What is the relationship between this virtio driver and qemu's virtio-pci-net driver? Does it require patches to qemu and/or the host kernel? Or is it purely a driver on the guest side?
> Solarflare provide a guest driver that works along with the virtio driver. The guest driver is called XNAP. The XNAP driver is also called the 'plugin'
>
> The Solarflare model for SR-IOV support is a hybrid of the two approaches. It uses a “plugin” approach which maintains the traditional (software) data path through virtio frontend to the KVM host (and then through the Linux bridge to the PF network driver). However, there is also an alternative (accelerated) data path through the VF directly to the network adapter from the guest. Packets can be received on either data path transparently to the guest VM’s network stack and on transmit the plugin (if loaded and enabled) takes the decision on whether to use the accelerated path.
>
> A VM can be created/cloned using traditional tools and networking to/from the VM initially uses the standard software network path. If a VF on the network adapter is then passed-through into the guest, the guest sees new hardware has been “hot-plugged” and binds the Solarflare plugin driver to this VF. This plugin driver automatically registers with the virtio driver as an accelerated network plugin. Once the VF driver has registered, subsequent traffic to/from the guest uses the accelerated data path accessing the adapter directly from the guest. If the VF is hot “unplugged” (i.e. removed from the guest), the plugin deregisters with the virtio front end and the networking traffic reverts to the software data path.
>
> This approach means there is no dependency on the VF or its driver for the networking data path of the VM. Acceleration can be disabled at any time if needed without losing network connectivity. Migration is fully supported in this model – between hosts with identical network adapters AND also between non-identical hosts.

Well, that answers my previous question about migration :-)

So I assume you perform an additional operation on top of libvirt
migration, i.e. you first detach the VF passthrough device, then
migrate, then at the other end you attach the new passthrough device, right?

>>> The new method I wish to propose, uses implicit pci-passthrough and there is no
>>> need to provide an explicit<hostdev>  element in the domain xml. The hostdev
>>> would be added to the live xml as non-persistent as suggested by Laine Stump in
>>> a previous post, link to which can be found at:
>>> https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html
>> Right. I was put off from that approach at the time because of the need to have a place to keep a stable guest-side PCI address. I went 180 degrees and tried to come up with something that would work as <hostdev>, but that didn't work and I'm back to where I started, but with that I believe is a plan that will work (see above). I'm interested to see how close that is to what you've got.
>>
>>> 1) In order to support the above mentioned hybrid model, the requirement is
>>> that the VF needs to be assigned the same MAC address as the virtio device in
>>> the guest. This enables the VF and the virtio device to bind successfully using
>>> the Solarflare driver called XNAP.
>>> Effectively we do not need to extend the<hostdev>  schema. This can be taken care
>>> of by the<interface>  element. Along with the MAC address the VLAN tags can also
>>> be taken care of by the<interface>/<network>  elements.
>> Exactly! :-)
>>
>> It sounds like at least the XML you've mapped out is similar to mine.
>>
>>> 2) The VF appears in the guest as a PCI device hence the MAC address of the VF
>>> is stored in the sysfs files. Assigning the MAC address to the VF before or
>>> after pci passthough is not an issue.
>>>
>>> Proposed steps to support the hybrid model of pci-passthrough in libvirt:
>>>
>>> 1)<network>  will have a new forward type='pci-passthroug-hybrid'. When forward
>>> type='pci-passthrough-hybrid' instead of a pool of Ethernet interfaces a<pf>
>>> element will need to be specified for implicit VF allocation as shown in the
>>> example below:
>>>
>>> <network>
>>>    <name>direct-network</name>
>>>      <forward mode="pci-passthrough-hybrid">
>> I was thinking just <forward mode='hostdev'>. Is there something special that needs to be done by libvirt to support your hybrid model beyond setting the MAC address of the VF and associating with a virtualport?
> In this XML snippet I use <pf dev='eth2'>. Libvirt will first implicitly autogenerate a list of VF's from the PF. Apart from setting the MAC address of the VF and associating with a virtual port libvirt will have to create a macvtap/bridged interface on eth2 (virtio in the guest) and pci-passthrough a VF (attached to eth2) into the guest

Interesting. This will mean that a single entry in the <devices> will
use two PCI addresses. How are you currently dealing with that? Just
dynamically using whatever free address is available? In general we like
to maintain the stability of guest-side PCI addresses, and failure to do
so causes problems (e.g. MS Windows may require re-activation). The need
for 2 PCI addresses in a single device entry will require some thought.

Also, I'm realizing that this model will require the ability to detach
"half of a device" (since you need to be able to detach the VF
passthrough device without detaching the virtio device).

One other thing related to the XML - are you planning on always
specifying <model type='virtio'> in the <interface> config? If so, the
<network> definition doesn't need to say anything about whether its
devices are to be used for normal hostdev, or hybrid - when the domain
allocates a device from the network, it will then just look at <model>
and decide whether to do a plain passthrough, or a hybrid mode
passthrough. This way guests that don't have support for hybrid mode can
use the same network as those that do (they'll just leave out <model>)

>>>      <pf dev="eth2"/>
>>>    </forward>
>>> </network>
>>>
>>> 2) In the domain's<interface>  definition, when type='network' and if network
>>> has forward type='pci-passthrough-hybrid', the domain code will request an
>>> unused VF from the physical device. Example:
>>>
>>> <interface type='network'>
>>>     <source network='direct-network'/>
>>>     <mac address='00:50:56:0f:86:3b'/>
>>>     <model type='virtio'/>
>> Hmm. This really is a strange beast. Specifyin virtio means that qemu is told about a standard virtio-net device (presumably at a different guest-side PCI address than the VF which has been assigned to the guest)
>>
>>>     <actual type='direct'>
>>>         <source mode='pci-passthrough-hybrid'/>
>>>     </actual>
>>
>> Of course the <actual> part will never show up in the static config, only in the runtime state after an allocation has been made (based on <source network='direct-network'/>
> Yes the actual part will not show in the static config
>>
>>>   </interface>
>>>
>>> 3) The code will then use the NodeDevice API to learn all the necessary PCI
>>> domain/slot/bus/function information.
>> Actually it appears that there are enough functions in the internal pci API to convert between PF <-> VF PCI address <-> VF net device name, so I don't think the nodedevice API will even be needed.
> I agree with using the PCI API
>>> 4) Before starting the guest the VF's PCI device name (0000:04:00.2) will be
>>> saved in interface/actual so that it can be easily retrieved if libvirtd is
>>> restarted.
>> Correct, if assigned from a <network>. I figured it would be stored in <actual> as <source> <address type='pci' domain='..... />.
> I was thinking of adding a field called vf_pci_addr, and saving the PCI addr as a string, to the ActualNetDef structure like below:
> struct _virDomainActualNetDef {
>     int type; /* enum virDomainNetType */
>     union {
>         struct {
>             char *brname;
>         } bridge;
>         struct {
>             char *linkdev;
>             char *vf_pci_addr; ( This will store the vf_pci_addr)
>             int mode; /* enum virMacvtapMode from util/macvtap.h */
>             virVirtualPortProfileParamsPtr virtPortProfile;
>         } direct;
>     } data;
>     virBandwidthPtr bandwidth;
> };

To allow for any type of hostdev (and allow re-using existing hostdev
management code), I've figured on something like this in the NetDef (and
similar in the ActualNetDef):

struct _virDomainNetDef {
    enum virDomainNetType type;
    unsigned char mac[VIR_MAC_BUFLEN];
     ...
    union {
         ...
        struct {
            char *linkdev;
            int mode; /* enum virMacvtapMode from util/macvtap.h */
            virNetDevVPortProfilePtr virtPortProfile;
        } direct;
**      struct {
**          virDomainHostdevDef def;
**          virNetDevVPortProfilePtr virtPortProfile;
        } hostdev;
    } data;
    struct {
        bool sndbuf_specified;
        unsigned long sndbuf;
    } tune;
     ...
    char *ifname;
    virDomainDeviceInfo info;
     ...
};

The HostdevDef would hold the PCI address information (or USB - not
important but it's already in there, so...) along with info about the
device's state prior to being assigned to the guest (you don't require
that for your application, but for general use by other cards it is a
requirement). The *really* nice thing about doing it this way, though,
is that a pointer to this hostdevdef can just be plopped right into the
domain's hostdevs[] array, and will then be included in things like
scans of all passthrough devices to determine which devices are in use, etc.

there's a bit more info about this in:

  https://www.redhat.com/archives/libvir-list/2012-January/msg01379.html

>From talking about this hybrid mode, it sounds like, in addition to
adding the hostdev type to the data union, we also want to add one
called hybrid:

        struct {
            char *linkdev;
            int mode; /* enum virMacvtapMode from util/macvtap.h */
            virDomainHostdevDef def;
            virDomainDeviceInfo hostdevSourceInfo;
            virNetDevVPortProfilePtr virtPortProfile;
        } hybrid;

(This also coincidentally solves the problem of needing to reserve two
pci addresses on the guest - the address used for the virtio device will
be in netdef->info, and the address for the passthrough device will be
in netdef->data.hybrid.hostdevSourceInfo.)

>>> 5) While building the qemu command line, if a network device has forward
>>> mode='pci-passthrough-hybrid', the code will add a (non-persisting)<hostdev>
>>> element to the qemu command line. This<hostdev>  will be marked as ephemeral
>>> before passing it to the guest. Ephemeral=transient.
>>>
>>> 6) During the process of network connection the MAC address of the VF will be
>>> set according to the domain<interface>  config. This step can also involve
>>> setting the VLAN tag, port profiles, etc.
>>>
>>> 7) Follwoing the above steps the guest will then start with implicit
>>> PCI-Passthough of a SRIOV VF.
>>>
>>> 8) When the guest is eventually destroyed, the Ethernet device will be free'd
>>> back to the network pool for use by another guest. Since the MAC address needs
>>> to be reset to 00:00:00:00:00:00 we do not need any reference to the higher
>>> level device definition.
>>>
>>> Since the VF is transient, it will be removed when the guest is shutdown and
>>> hotplugged again, by the libvirt API, when the guest is started. Hence, in
>>> order to get a list of hostdevs attached to a guest we only ever have to look
>>> at the<hostdev>  element.
>>>
>>> One of the objections that had been raised following Mr Stump's post was that a
>>> transient hostdev will not ensure that the guest PCI address does not get
>>> changed each time the guest is run, but since the VF is a pci device in the
>>> guest and does not bind to specific driver, we can work with this proposed
>>> solution.
>>>
>>> Migration is possible using the above method without any explicit effort from
>>> the user in the following way:
>>> 1) Begin stage: All the ephemeral devices do not make their way into the xml
>>> that is passed to the destination.
>>> 2) Prepare stage: Replacement VF's on the destination, if present, will be
>>> automatically reserved and plugged in the guest by the networking code.
>>> 3) Perform stage: Any ephemeral device are removed from the guest by libvirt.
>>> 4) Confirm stage: If migration fails the VF's will be restored else the VF's
>>> will be free's back to the networking pool by the networking code.
>> I've been tactfully avoiding migration questions :-)
> :-)
>
>>>     
>>> I have been working on the patches for the above mentioned method and would
>>> like to know your take on the hybrid model.
>> The part that is still confusing me is that you specify <model type='virtio'/> Is anything actually done with that? If not, then what you're talking about is very similar to what I'm trying to implement.
> We actually have a working hybrid model with exceptional performance available with the libvirt patches for RHEL6.1.

Believe me, we've heard about it :-)

> I am currently working on porting these libvirt patches for RHEL6.2 for some of our customers as well as working on cleaning these patches to submit upstream.
>
> We would be happy to send you some of our hardware if you wish to test our hybrid model on RHEL6.1.
>
>> Maybe we should get together offline - it's likely we can save each other a lot of time! (well, more likely that you can save me time than vice versa... :-)
> I definitely think we should work together offline. I would be happy to share notes and code that I am currently working on.
> Do let me know how I could be of help in our joint effort to get pci-passthrough support for SRIOV VF's.

I'll send you separate mail off-list.