[libvirt] RFC: PCI-Passthrough of SRIOV VF's new forward mode

Wed Feb 8 11:18:50 UTC 2012

Hello Laine,

Many Thanks for reviewing the RFC. Please find my reply inline.

On 02/07/2012 02:36 AM, Laine Stump wrote:
> On 02/06/2012 12:58 PM, Shradha Shah wrote:
>> RFC: New network forward type pci-passthrough-hybrid
>>
>> I saw a couple of posts regarding PCI-Passthrough usage of SRIOV VF's a couple
>> of weeks ago (20th Jan 2012). Initially I was going to post this RFC along with
>> a set of patches. I would require a few more days to clean my patches for
>> submission and hence I would start with an RFC on a new method to
>> manage PCI-Passthrough of SRIOV VF's.
> 
> 
> I'm working on something similar, but purely in the domain's device list first.
> my plan is that PCI passthrough interface devices will be defined as <interface type='hostdev'> (rather than in a <hostdev>), thus allowing config of all the network interface-related things that may be needed without polluting <hostdev> (and yet giving us an anchor where the guest-side PCI address can be fixed so that it remains the same across restarts of the guest). I discussed this in a later email last month:
> 
> https://www.redhat.com/archives/libvir-list/2012-January/msg00840.html
> 
> Note that the first message is a proposal I made to use <hostdev> that was discarded, and we later arrived at:
> 
>   <devices>
>     <interface type='hostdev'>

I was thinking more like <interface type='network'> and in the network xml <forward mode='pci-passthrough'/'hostdev'> since I was thinking adding a new mode to the existing enum virNetworkForwardType.
So currently the virNetworkForwardType has vepa, private, bridge, passthrough.
I was thinking of adding 
1) pci-passthrough or hostdev (VF passthrough to the guest, no virtio interface in the guest, as suggested in your previous proposals)
2) pci-passthrough-hybrid or hostdev-hybrid (VF passthrough to the guest + virtio interface in the guest to support migration with maximum performance results)

>       <source dev='eth22'/>

I was thinking on terms of having the source dev mentioned in the network XML which will suppress any problems we might face while migration. 

Having a <source dev='eth22> in the domain XML will mean that a similar device needs to be present on the destination host after migration else migration would fail. 

>       <mac address='00:16:3e:5d:c7:9e'/>
>       ...
>     </interface>
>   </devices>
> 
> (see the first response from Paolo in the thread), in many ways returning to
> the proposal of last August. The above XML will set the
> MAC address of eth22, potentially associate a 802.1QbX port profile (if there is
> a<virtualport>  element), decode "eth22" into a PCI device, then attach that
> device to the guest.
> 
> It will also be acceptable to specify the source (host side) address as a pci address rather than a net device name (for those cases when the VF isn't bound to a driver and thus has no net device name).

This sounds like a good idea when using Solarflare network adapter as Solarflare VF's do not have a net device name and operate with PCI addresses.

> 
> My plan has been to first implement this in <interface>, and then add the <forward mode='hostdev'> support in <network> (which would make things especially nice with your new patches to auto-generate the list of devices in the pool by specifying just a PF).
> 
> Since you have some code already done, maybe we should compare notes - so far I've been working more on rearranging the data structures to accommodate the dual identity of a device that needs to be <interface> for config purposes, but <hostdev> (plus extra functionality) for device attachment purposes.
> 

The work that we are trying to achieve follows definitely the same path and it would indeed be a great idea to share notes and part of code between ourselves before we submit patches upstream.

> 
>>
>> Solarflare Ethernet card supports 127 VF's on each port. The MAC address of
>> each unused VF is 00:00:00:00:00:00 by default. Hence the MAC address of the VF
>> does not change on every reboot. There is no VF driver on the host. Each VF
>> does not correspond to an Ethernet device. Instead, VF's are managed using the
>> PCI sysfs files.
> 
> It's interesting that you say each VF doesn't correspond to an ethernet device. Is it that it doesn't, or just doesn't have to (but might)? My limited experiences with sriov hardware has been with an Intel 82576 card, which can operate in either fashion (if the igbvf driver is loaded and bound to the VFs, they have a network device name, otherwise they are visible only via the PF).

Solarflare do not provide a separate VF driver (like the ixgbevf), we provide only a PF driver (sfc) hence the VF doesn't correspond to an ethernet device. 

> 
> 
>>
>> With the pci-passthrough-hybrid model when the VF is passed into the guest,
>> it appears in the guest as a PCI device and not as a network device. A virtual
>> network device in the form of a virtio interface is also present
>> in the guest. The virtio device in the guest comes from either bridging the
>> physical network device or by creating a macvtap interface of type (vepa,
>> private, bridge) on the physical network device. The virtio device
>> and the VF bind together in the guest to create an accelerated and a
>> non-accelerated path.
> 
> Now *this* is something I've not heard of before. Are you saying you attach the PCI device of the VF through to the guest, and also at the same time have a virtio device that cooperates with the passed-through PCI device? 

Yes, that is correct. In the hybrid model the guest will have a virtio network device along with a passthrough VF as a PCI device. 

> What is the relationship between this virtio driver and qemu's virtio-pci-net driver? Does it require patches to qemu and/or the host kernel? Or is it purely a driver on the guest side?

Solarflare provide a guest driver that works along with the virtio driver. The guest driver is called XNAP. The XNAP driver is also called the 'plugin'

The Solarflare model for SR-IOV support is a hybrid of the two approaches. It uses a “plugin” approach which maintains the traditional (software) data path through virtio frontend to the KVM host (and then through the Linux bridge to the PF network driver). However, there is also an alternative (accelerated) data path through the VF directly to the network adapter from the guest. Packets can be received on either data path transparently to the guest VM’s network stack and on transmit the plugin (if loaded and enabled) takes the decision on whether to use the accelerated path.

A VM can be created/cloned using traditional tools and networking to/from the VM initially uses the standard software network path. If a VF on the network adapter is then passed-through into the guest, the guest sees new hardware has been “hot-plugged” and binds the Solarflare plugin driver to this VF. This plugin driver automatically registers with the virtio driver as an accelerated network plugin. Once the VF driver has registered, subsequent traffic to/from the guest uses the accelerated data path accessing the adapter directly from the guest. If the VF is hot “unplugged” (i.e. removed from the guest), the plugin deregisters with the virtio front end and the networking traffic reverts to the software data path.

This approach means there is no dependency on the VF or its driver for the networking data path of the VM. Acceleration can be disabled at any time if needed without losing network connectivity. Migration is fully supported in this model – between hosts with identical network adapters AND also between non-identical hosts.

>>
>> The new method I wish to propose, uses implicit pci-passthrough and there is no
>> need to provide an explicit<hostdev>  element in the domain xml. The hostdev
>> would be added to the live xml as non-persistent as suggested by Laine Stump in
>> a previous post, link to which can be found at:
>> https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html
> 
> Right. I was put off from that approach at the time because of the need to have a place to keep a stable guest-side PCI address. I went 180 degrees and tried to come up with something that would work as <hostdev>, but that didn't work and I'm back to where I started, but with that I believe is a plan that will work (see above). I'm interested to see how close that is to what you've got.
> 
>>
>> 1) In order to support the above mentioned hybrid model, the requirement is
>> that the VF needs to be assigned the same MAC address as the virtio device in
>> the guest. This enables the VF and the virtio device to bind successfully using
>> the Solarflare driver called XNAP.
>> Effectively we do not need to extend the<hostdev>  schema. This can be taken care
>> of by the<interface>  element. Along with the MAC address the VLAN tags can also
>> be taken care of by the<interface>/<network>  elements.
> 
> Exactly! :-)
> 
> It sounds like at least the XML you've mapped out is similar to mine.
> 
>>
>> 2) The VF appears in the guest as a PCI device hence the MAC address of the VF
>> is stored in the sysfs files. Assigning the MAC address to the VF before or
>> after pci passthough is not an issue.
>>
>> Proposed steps to support the hybrid model of pci-passthrough in libvirt:
>>
>> 1)<network>  will have a new forward type='pci-passthroug-hybrid'. When forward
>> type='pci-passthrough-hybrid' instead of a pool of Ethernet interfaces a<pf>
>> element will need to be specified for implicit VF allocation as shown in the
>> example below:
>>
>> <network>
>>    <name>direct-network</name>
>>      <forward mode="pci-passthrough-hybrid">
> 
> I was thinking just <forward mode='hostdev'>. Is there something special that needs to be done by libvirt to support your hybrid model beyond setting the MAC address of the VF and associating with a virtualport?
In this XML snippet I use <pf dev='eth2'>. Libvirt will first implicitly autogenerate a list of VF's from the PF. Apart from setting the MAC address of the VF and associating with a virtual port libvirt will have to create a macvtap/bridged interface on eth2 (virtio in the guest) and pci-passthrough a VF (attached to eth2) into the guest

> 
>>      <pf dev="eth2"/>
>>    </forward>
>> </network>
>>
>> 2) In the domain's<interface>  definition, when type='network' and if network
>> has forward type='pci-passthrough-hybrid', the domain code will request an
>> unused VF from the physical device. Example:
>>
>> <interface type='network'>
>>     <source network='direct-network'/>
>>     <mac address='00:50:56:0f:86:3b'/>
>>     <model type='virtio'/>
> 
> Hmm. This really is a strange beast. Specifyin virtio means that qemu is told about a standard virtio-net device (presumably at a different guest-side PCI address than the VF which has been assigned to the guest)
> 
>>     <actual type='direct'>
>>         <source mode='pci-passthrough-hybrid'/>
>>     </actual>
> 
> 
> Of course the <actual> part will never show up in the static config, only in the runtime state after an allocation has been made (based on <source network='direct-network'/>
Yes the actual part will not show in the static config
> 
> 
>>   </interface>
>>
>> 3) The code will then use the NodeDevice API to learn all the necessary PCI
>> domain/slot/bus/function information.
> 
> Actually it appears that there are enough functions in the internal pci API to convert between PF <-> VF PCI address <-> VF net device name, so I don't think the nodedevice API will even be needed.

I agree with using the PCI API
> 
>>
>> 4) Before starting the guest the VF's PCI device name (0000:04:00.2) will be
>> saved in interface/actual so that it can be easily retrieved if libvirtd is
>> restarted.
> 
> Correct, if assigned from a <network>. I figured it would be stored in <actual> as <source> <address type='pci' domain='..... />.
I was thinking of adding a field called vf_pci_addr, and saving the PCI addr as a string, to the ActualNetDef structure like below:
struct _virDomainActualNetDef {
    int type; /* enum virDomainNetType */
    union {
        struct {
            char *brname;
        } bridge;
        struct {
            char *linkdev;
            char *vf_pci_addr; ( This will store the vf_pci_addr)
            int mode; /* enum virMacvtapMode from util/macvtap.h */
            virVirtualPortProfileParamsPtr virtPortProfile;
        } direct;
    } data;
    virBandwidthPtr bandwidth;
};

> 
>>
>> 5) While building the qemu command line, if a network device has forward
>> mode='pci-passthrough-hybrid', the code will add a (non-persisting)<hostdev>
>> element to the qemu command line. This<hostdev>  will be marked as ephemeral
>> before passing it to the guest. Ephemeral=transient.
>>
>> 6) During the process of network connection the MAC address of the VF will be
>> set according to the domain<interface>  config. This step can also involve
>> setting the VLAN tag, port profiles, etc.
>>
>> 7) Follwoing the above steps the guest will then start with implicit
>> PCI-Passthough of a SRIOV VF.
>>
>> 8) When the guest is eventually destroyed, the Ethernet device will be free'd
>> back to the network pool for use by another guest. Since the MAC address needs
>> to be reset to 00:00:00:00:00:00 we do not need any reference to the higher
>> level device definition.
>>
>> Since the VF is transient, it will be removed when the guest is shutdown and
>> hotplugged again, by the libvirt API, when the guest is started. Hence, in
>> order to get a list of hostdevs attached to a guest we only ever have to look
>> at the<hostdev>  element.
>>
>> One of the objections that had been raised following Mr Stump's post was that a
>> transient hostdev will not ensure that the guest PCI address does not get
>> changed each time the guest is run, but since the VF is a pci device in the
>> guest and does not bind to specific driver, we can work with this proposed
>> solution.
>>
>> Migration is possible using the above method without any explicit effort from
>> the user in the following way:
>> 1) Begin stage: All the ephemeral devices do not make their way into the xml
>> that is passed to the destination.
>> 2) Prepare stage: Replacement VF's on the destination, if present, will be
>> automatically reserved and plugged in the guest by the networking code.
>> 3) Perform stage: Any ephemeral device are removed from the guest by libvirt.
>> 4) Confirm stage: If migration fails the VF's will be restored else the VF's
>> will be free's back to the networking pool by the networking code.
> 
> I've been tactfully avoiding migration questions :-)

:-)

> 
>>     
>> I have been working on the patches for the above mentioned method and would
>> like to know your take on the hybrid model.
> 
> The part that is still confusing me is that you specify <model type='virtio'/> Is anything actually done with that? If not, then what you're talking about is very similar to what I'm trying to implement.

We actually have a working hybrid model with exceptional performance available with the libvirt patches for RHEL6.1.

I am currently working on porting these libvirt patches for RHEL6.2 for some of our customers as well as working on cleaning these patches to submit upstream.

We would be happy to send you some of our hardware if you wish to test our hybrid model on RHEL6.1.

> 
> Maybe we should get together offline - it's likely we can save each other a lot of time! (well, more likely that you can save me time than vice versa... :-)

I definitely think we should work together offline. I would be happy to share notes and code that I am currently working on.
Do let me know how I could be of help in our joint effort to get pci-passthrough support for SRIOV VF's.

Many Thanks,
Regards,
Shradha Shah

> 
> -- 
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list