[libvirt] libvirt support for mediated devices

Laine Stump laine at redhat.com
Mon Jan 9 03:36:39 UTC 2017


VFIO's new mediated device interface  "is used for allowing 
software-defined devices to be exposed through VFIO while the host 
driver manages access to the interface" (quoted from 
http://www.phoronix.com/scan.php?page=news_item&px=VFIO-Linux-4.10-Mediated 
). Now that the support for mediated devices has been added to the 
upstream Linux kernel, there is a stable API that libvirt can use to 
support assigning mediated devices (e.g. virtual GPUs) to Qemu/KVM 
guests (or presumably any other hypervisor that support device 
assignment via VFIO.

We've had a few private discussions about what should be added to 
libvirt, and now have enough rough ideas to start discussing it on the list.

The major requirements we've come up with so far (in what I think is a 
reasonable order of implementation) are:

1) The ability to assign an already-created mediated device to a guest 
(think of "<hostdev ... managed='no'>" mode for assigning regular PCI 
devices).

2) reporting of the capabilities of a mediated device "parent" 
(including, for example, the supported types and maximum number of child 
devices that are supported, and the names of all existing child devices) 
and of existing child devices (via the node device APIs, e.g. virsh 
nodedev-list and virsh nodedev-dumpxml)

3) The ability to create and destroy mediated devices via the NodeDevice 
API. (similar in function to the "virsh detach-device and virsh 
attach-device commands - i.e. they make a device ready to be assigned to 
a guest using <hostdev>, but have no persistent config and no 
"auto-start" capability).

4) Support for "managed" mediated devices - libvirt will create a new 
child device as required, and destroy it when it's no longer needed 
(similar to the way that standard PCI hostdevs are (when managed="yes") 
detached from their host driver and attached to vfio-pci as needed) (I 
think this is less useful than item (5), but is simpler and may be a 
good way to test all the preceding additions (as well as being useful in 
some simpler configurations).

5) The ability to create and manage "pools" of mediated devices, with 
persistent config and an auto-start capability so that the device pools 
are automatically created when the host is booted (this will require 
either some form of persistent config and lifecycle management to be 
added to the nodedevice driver, or a new libvirt driver type with 
functionality similar to storage pools, but used to manage pools of mdev 
child devices).

=========

Going back to the beginning, with slightly more detail:

1) "Unmanaged" mediated device assignment - assigning an existing device 
to a virtual machine

This will assume that the desired child device has already been created, 
and can be found in /sys/bus/mdev/devices/$UUID. Here's a first attempt 
at what the XML would look like:

     <hostdev mode='subsystem' type='pci' /managed='no'>/
         <source>  <!-- (maybe add "type='mdev'" ???) -->
             <mdev uuid='$uuid'/>
         </source>
         <address type='pci' blah blah blah/> <!-- PCI address in the 
guest -->
      </hostdev>

In the past, the "type" attribute of hostdev described the type on both 
the host and the guest. With mediated devices, the device isn't visible 
on the host as a PCI device, but just as a software device. So the type 
attribute in <hostdev> now gives the type of the device on the guest, 
and the device type on the host is determined from the contents of <source>.

Erik had a different suggestion for this (which I think he's already 
working on patches for) - that the type attribute in <hostdev> should be 
the type of the device in the *host*, and the type in the guest would be 
that given in the <address>. Something like this I think:

     <hostdev mode='subsystem' type='mdev' /managed='no'/>
         <source>
             <mdev uuid='$uuid'/>
         </source>
         <address type='pci' blah blah blah/>
      </hostdev>

(Is this correct, Erik?)

(I arrived at my suggestion by the thinking that, in other places where 
there are similar attributes for the host and guest side, e.g. the IP 
addresses and routes that can be added on both the host and guest side 
of an <interface>, everything related to the host side is in the 
<source> subelement, while things related to the guest are directly 
under the toplevel of the device element. On the other hand, the 
"managed" attribute isn't something related to the guest, but to the 
host, and his idea has less redundancy, so maybe he's onto something...)

(NB: a mediated device could be exposed to the guest as a PCI device, a 
CCW device, or anything else supported by vfio. The type of device that 
the guest will see can be determined from the contents of 
mdev_supported_types/<type-id>/device_api under the parent device's 
directory in sysfs (it will be, e.g., "vfio-pci" or "vfio-ccw"). But 
libvirt assigns guest-side addresses at the time a domain is defined, 
and it's possible that the mdev child device won't be created yet at 
define time (and therefore we won't know which parent device it's 
associated with, and so we won't be able to look at device_api). In such 
situations, it will be up to management to know something about the 
device it will be creating and assume a type. Fortunately this is a 
reasonably safe thing to do - on x86 platforms we can be fairly certain 
that the device will be a PCI device. (And, because this also makes a 
difference for some machinetypes, that it will be a PCI Express device). 
We will want to check device_api at runtime though, to validate that the 
guest-side device really is a PCI device.

==

2) Reporting parent and child mediated devices and their capabilities in 
the node device API.

There are 3 stages to this:

a) add mediated child devices to the list of devices provided by "virsh 
nodedev-list". These will be called "mdev_$UUID", and will show up as 
descendents of their respective parent devices in "virsh nodedev-list 
--tree". The list of all these devices can easily be retrieved by 
enumerating the links in /sys/bus/mdev/devices/$UUID.

b) report the capabilities of parent devices in their dumpxml output. 
This will included supported child device types and a list of current 
children.

I don't have any experience with nodedev reporting for SCSI devices, but 
recently noticed that nodedev-list can report lists of devices with 
certain capabilities, e.g. "virsh nodedev-list --cap=scsi_host". Based 
on this, I guess it would be useful for the parent devices to show 
something like this (using the sample mtty driver as an example):

      <device>
         <name>pci_0000_02_00_0</name>
         <parent>pci_0000_00_04_0</parent>
         <driver>
           <name>mtty</name>
         </driver>
        <capability type='mdev_parent'>
           [list of supported types, each with number allowed]
           [list of current child devices (just giving uuid or device 
name ("mdev_$uuid"?)]
           [other info about parent/children?]
        </capability>
        ...

Likewise, a nodedev-dumpxml of a child device should contain a pointer 
to the parent device.

c) respond to dumpxml requests for mediated child devices. This should 
include at least the uuid/type of the child device, and a link back to 
the parent device (and I suppose somehow include <capability 
type='mdev_child'>  so that it can be filtered with virsh modedev-list?)

==

(3), (4), and (5) need more thought that I haven't gotten to yet. TBD 
(if anyone else has thoughts on those, please share!)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20170108/03e668b3/attachment-0001.htm>


More information about the libvir-list mailing list