[libvirt] RFC: PCI-Passthrough of SRIOV VF's new forward mode

Tue Feb 7 02:36:22 UTC 2012

On 02/06/2012 12:58 PM, Shradha Shah wrote:
> RFC: New network forward type pci-passthrough-hybrid
>
> I saw a couple of posts regarding PCI-Passthrough usage of SRIOV VF's a couple
> of weeks ago (20th Jan 2012). Initially I was going to post this RFC along with
> a set of patches. I would require a few more days to clean my patches for
> submission and hence I would start with an RFC on a new method to
> manage PCI-Passthrough of SRIOV VF's.

I'm working on something similar, but purely in the domain's device list 
first.
my plan is that PCI passthrough interface devices will be defined as 
<interface type='hostdev'> (rather than in a <hostdev>), thus allowing 
config of all the network interface-related things that may be needed 
without polluting <hostdev> (and yet giving us an anchor where the 
guest-side PCI address can be fixed so that it remains the same across 
restarts of the guest). I discussed this in a later email last month:

https://www.redhat.com/archives/libvir-list/2012-January/msg00840.html

Note that the first message is a proposal I made to use <hostdev> that 
was discarded, and we later arrived at:

   <devices>
     <interface type='hostdev'>
       <source dev='eth22'/>
       <mac address='00:16:3e:5d:c7:9e'/>
       ...
     </interface>
   </devices>

(see the first response from Paolo in the thread), in many ways returning to
the proposal of last August. The above XML will set the
MAC address of eth22, potentially associate a 802.1QbX port profile (if there is
a<virtualport>  element), decode "eth22" into a PCI device, then attach that
device to the guest.

It will also be acceptable to specify the source (host side) address as 
a pci address rather than a net device name (for those cases when the VF 
isn't bound to a driver and thus has no net device name).

My plan has been to first implement this in <interface>, and then add 
the <forward mode='hostdev'> support in <network> (which would make 
things especially nice with your new patches to auto-generate the list 
of devices in the pool by specifying just a PF).

Since you have some code already done, maybe we should compare notes - 
so far I've been working more on rearranging the data structures to 
accommodate the dual identity of a device that needs to be <interface> 
for config purposes, but <hostdev> (plus extra functionality) for device 
attachment purposes.

>
> Solarflare Ethernet card supports 127 VF's on each port. The MAC address of
> each unused VF is 00:00:00:00:00:00 by default. Hence the MAC address of the VF
> does not change on every reboot. There is no VF driver on the host. Each VF
> does not correspond to an Ethernet device. Instead, VF's are managed using the
> PCI sysfs files.

It's interesting that you say each VF doesn't correspond to an ethernet 
device. Is it that it doesn't, or just doesn't have to (but might)? My 
limited experiences with sriov hardware has been with an Intel 82576 
card, which can operate in either fashion (if the igbvf driver is loaded 
and bound to the VFs, they have a network device name, otherwise they 
are visible only via the PF).

>
> With the pci-passthrough-hybrid model when the VF is passed into the guest,
> it appears in the guest as a PCI device and not as a network device. A virtual
> network device in the form of a virtio interface is also present
> in the guest. The virtio device in the guest comes from either bridging the
> physical network device or by creating a macvtap interface of type (vepa,
> private, bridge) on the physical network device. The virtio device
> and the VF bind together in the guest to create an accelerated and a
> non-accelerated path.

Now *this* is something I've not heard of before. Are you saying you 
attach the PCI device of the VF through to the guest, and also at the 
same time have a virtio device that cooperates with the passed-through 
PCI device? What is the relationship between this virtio driver and 
qemu's virtio-pci-net driver? Does it require patches to qemu and/or the 
host kernel? Or is it purely a driver on the guest side?

>
> The new method I wish to propose, uses implicit pci-passthrough and there is no
> need to provide an explicit<hostdev>  element in the domain xml. The hostdev
> would be added to the live xml as non-persistent as suggested by Laine Stump in
> a previous post, link to which can be found at:
> https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html

Right. I was put off from that approach at the time because of the need 
to have a place to keep a stable guest-side PCI address. I went 180 
degrees and tried to come up with something that would work as 
<hostdev>, but that didn't work and I'm back to where I started, but 
with that I believe is a plan that will work (see above). I'm interested 
to see how close that is to what you've got.

>
> 1) In order to support the above mentioned hybrid model, the requirement is
> that the VF needs to be assigned the same MAC address as the virtio device in
> the guest. This enables the VF and the virtio device to bind successfully using
> the Solarflare driver called XNAP.
> Effectively we do not need to extend the<hostdev>  schema. This can be taken care
> of by the<interface>  element. Along with the MAC address the VLAN tags can also
> be taken care of by the<interface>/<network>  elements.

Exactly! :-)

It sounds like at least the XML you've mapped out is similar to mine.

>
> 2) The VF appears in the guest as a PCI device hence the MAC address of the VF
> is stored in the sysfs files. Assigning the MAC address to the VF before or
> after pci passthough is not an issue.
>
> Proposed steps to support the hybrid model of pci-passthrough in libvirt:
>
> 1)<network>  will have a new forward type='pci-passthroug-hybrid'. When forward
> type='pci-passthrough-hybrid' instead of a pool of Ethernet interfaces a<pf>
> element will need to be specified for implicit VF allocation as shown in the
> example below:
>
> <network>
>    <name>direct-network</name>
>      <forward mode="pci-passthrough-hybrid">

I was thinking just <forward mode='hostdev'>. Is there something special 
that needs to be done by libvirt to support your hybrid model beyond 
setting the MAC address of the VF and associating with a virtualport?

>      <pf dev="eth2"/>
>    </forward>
> </network>
>
> 2) In the domain's<interface>  definition, when type='network' and if network
> has forward type='pci-passthrough-hybrid', the domain code will request an
> unused VF from the physical device. Example:
>
> <interface type='network'>
>     <source network='direct-network'/>
>     <mac address='00:50:56:0f:86:3b'/>
>     <model type='virtio'/>

Hmm. This really is a strange beast. Specifyin virtio means that qemu is 
told about a standard virtio-net device (presumably at a different 
guest-side PCI address than the VF which has been assigned to the guest)

>     <actual type='direct'>
>         <source mode='pci-passthrough-hybrid'/>
>     </actual>

Of course the <actual> part will never show up in the static config, 
only in the runtime state after an allocation has been made (based on 
<source network='direct-network'/>

>   </interface>
>
> 3) The code will then use the NodeDevice API to learn all the necessary PCI
> domain/slot/bus/function information.

Actually it appears that there are enough functions in the internal pci 
API to convert between PF <-> VF PCI address <-> VF net device name, so 
I don't think the nodedevice API will even be needed.

>
> 4) Before starting the guest the VF's PCI device name (0000:04:00.2) will be
> saved in interface/actual so that it can be easily retrieved if libvirtd is
> restarted.

Correct, if assigned from a <network>. I figured it would be stored in 
<actual> as <source> <address type='pci' domain='..... />.

>
> 5) While building the qemu command line, if a network device has forward
> mode='pci-passthrough-hybrid', the code will add a (non-persisting)<hostdev>
> element to the qemu command line. This<hostdev>  will be marked as ephemeral
> before passing it to the guest. Ephemeral=transient.
>
> 6) During the process of network connection the MAC address of the VF will be
> set according to the domain<interface>  config. This step can also involve
> setting the VLAN tag, port profiles, etc.
>
> 7) Follwoing the above steps the guest will then start with implicit
> PCI-Passthough of a SRIOV VF.
>
> 8) When the guest is eventually destroyed, the Ethernet device will be free'd
> back to the network pool for use by another guest. Since the MAC address needs
> to be reset to 00:00:00:00:00:00 we do not need any reference to the higher
> level device definition.
>
> Since the VF is transient, it will be removed when the guest is shutdown and
> hotplugged again, by the libvirt API, when the guest is started. Hence, in
> order to get a list of hostdevs attached to a guest we only ever have to look
> at the<hostdev>  element.
>
> One of the objections that had been raised following Mr Stump's post was that a
> transient hostdev will not ensure that the guest PCI address does not get
> changed each time the guest is run, but since the VF is a pci device in the
> guest and does not bind to specific driver, we can work with this proposed
> solution.
>
> Migration is possible using the above method without any explicit effort from
> the user in the following way:
> 1) Begin stage: All the ephemeral devices do not make their way into the xml
> that is passed to the destination.
> 2) Prepare stage: Replacement VF's on the destination, if present, will be
> automatically reserved and plugged in the guest by the networking code.
> 3) Perform stage: Any ephemeral device are removed from the guest by libvirt.
> 4) Confirm stage: If migration fails the VF's will be restored else the VF's
> will be free's back to the networking pool by the networking code.

I've been tactfully avoiding migration questions :-)

> 	
> I have been working on the patches for the above mentioned method and would
> like to know your take on the hybrid model.

The part that is still confusing me is that you specify <model 
type='virtio'/> Is anything actually done with that? If not, then what 
you're talking about is very similar to what I'm trying to implement.

Maybe we should get together offline - it's likely we can save each 
other a lot of time! (well, more likely that you can save me time than 
vice versa... :-)