[libvirt] RFC: PCI-Passthrough of SRIOV VF's new forward mode

Mon Feb 6 17:58:09 UTC 2012

RFC: New network forward type pci-passthrough-hybrid 

I saw a couple of posts regarding PCI-Passthrough usage of SRIOV VF's a couple
of weeks ago (20th Jan 2012). Initially I was going to post this RFC along with
a set of patches. I would require a few more days to clean my patches for
submission and hence I would start with an RFC on a new method to
manage PCI-Passthrough of SRIOV VF's.

I work for Solarflare Communications who make 10G network adapters. We 
currently have SRIOV capable adapters available and in production and we would 
like to work with upstream libvirt to develop the required support for our 
hardware.

This RFC introduces a new network forward mode to libvirt called 
pci-passthrough-hybrid and provides a solution for migration with 
PCI-Passthrough as well as providing significant increase in the networking
performance.

The Solarflare SRIOV driver architecture for KVM is explained in the Release
notes which can be found here:
https://support.solarflare.com/index.php?view=categories&id=1813&option=com_cognidox&Itemid=2

This is a working model and currently available to Solarflare Customers for
evaluation. The hybrid model of the SRIOV driver provided by Solarflare
currently achieves the highest SPECvirt performance in the market.

Solarflare Ethernet card supports 127 VF's on each port. The MAC address of
each unused VF is 00:00:00:00:00:00 by default. Hence the MAC address of the VF
does not change on every reboot. There is no VF driver on the host. Each VF
does not correspond to an Ethernet device. Instead, VF's are managed using the
PCI sysfs files. 

With the pci-passthrough-hybrid model when the VF is passed into the guest, 
it appears in the guest as a PCI device and not as a network device. A virtual
network device in the form of a virtio interface is also present
in the guest. The virtio device in the guest comes from either bridging the
physical network device or by creating a macvtap interface of type (vepa,
private, bridge) on the physical network device. The virtio device
and the VF bind together in the guest to create an accelerated and a 
non-accelerated path. 

The new method I wish to propose, uses implicit pci-passthrough and there is no
need to provide an explicit <hostdev> element in the domain xml. The hostdev
would be added to the live xml as non-persistent as suggested by Laine Stump in
a previous post, link to which can be found at:
https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html

1) In order to support the above mentioned hybrid model, the requirement is
that the VF needs to be assigned the same MAC address as the virtio device in
the guest. This enables the VF and the virtio device to bind successfully using 
the Solarflare driver called XNAP.
Effectively we do not need to extend the <hostdev> schema. This can be taken care 
of by the <interface> element. Along with the MAC address the VLAN tags can also 
be taken care of by the <interface>/<network> elements.

2) The VF appears in the guest as a PCI device hence the MAC address of the VF
is stored in the sysfs files. Assigning the MAC address to the VF before or
after pci passthough is not an issue.

Proposed steps to support the hybrid model of pci-passthrough in libvirt:

1) <network> will have a new forward type='pci-passthroug-hybrid'. When forward
type='pci-passthrough-hybrid' instead of a pool of Ethernet interfaces a <pf>
element will need to be specified for implicit VF allocation as shown in the
example below:

<network>
  <name>direct-network</name>
    <forward mode="pci-passthrough-hybrid">
    <pf dev="eth2"/>
  </forward>
</network>

2) In the domain's <interface> definition, when type='network' and if network
has forward type='pci-passthrough-hybrid', the domain code will request an
unused VF from the physical device. Example:

<interface type='network'>
   <source network='direct-network'/>
   <mac address='00:50:56:0f:86:3b'/>
   <model type='virtio'/>
   <actual type='direct'>
       <source mode='pci-passthrough-hybrid'/>
   </actual>
 </interface>

3) The code will then use the NodeDevice API to learn all the necessary PCI
domain/slot/bus/function information.

4) Before starting the guest the VF's PCI device name (0000:04:00.2) will be
saved in interface/actual so that it can be easily retrieved if libvirtd is
restarted.

5) While building the qemu command line, if a network device has forward
mode='pci-passthrough-hybrid', the code will add a (non-persisting) <hostdev>
element to the qemu command line. This <hostdev> will be marked as ephemeral
before passing it to the guest. Ephemeral=transient.

6) During the process of network connection the MAC address of the VF will be
set according to the domain <interface> config. This step can also involve
setting the VLAN tag, port profiles, etc.

7) Follwoing the above steps the guest will then start with implicit 
PCI-Passthough of a SRIOV VF. 

8) When the guest is eventually destroyed, the Ethernet device will be free'd
back to the network pool for use by another guest. Since the MAC address needs
to be reset to 00:00:00:00:00:00 we do not need any reference to the higher
level device definition.

Since the VF is transient, it will be removed when the guest is shutdown and
hotplugged again, by the libvirt API, when the guest is started. Hence, in
order to get a list of hostdevs attached to a guest we only ever have to look
at the <hostdev> element. 

One of the objections that had been raised following Mr Stump's post was that a
transient hostdev will not ensure that the guest PCI address does not get
changed each time the guest is run, but since the VF is a pci device in the
guest and does not bind to specific driver, we can work with this proposed
solution.

Migration is possible using the above method without any explicit effort from
the user in the following way:
1) Begin stage: All the ephemeral devices do not make their way into the xml
that is passed to the destination.
2) Prepare stage: Replacement VF's on the destination, if present, will be
automatically reserved and plugged in the guest by the networking code.
3) Perform stage: Any ephemeral device are removed from the guest by libvirt.
4) Confirm stage: If migration fails the VF's will be restored else the VF's
will be free's back to the networking pool by the networking code.

I have been working on the patches for the above mentioned method and would
like to know your take on the hybrid model.