[libvirt] [Qemu-devel] [RFC] libvirt vGPU QEMU integration

Thu Aug 25 15:18:57 UTC 2016

On 08/24/2016 06:29 PM, Daniel P. Berrange wrote:
> On Thu, Aug 18, 2016 at 09:41:59AM -0700, Neo Jia wrote:
>> Hi libvirt experts,
>>
>> I am starting this email thread to discuss the potential solution / proposal of
>> integrating vGPU support into libvirt for QEMU.
>>
>> Some quick background, NVIDIA is implementing a VFIO based mediated device
>> framework to allow people to virtualize their devices without SR-IOV, for
>> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
>> VFIO API to process the memory / interrupt as what QEMU does today with passthru
>> device.
>>
>> The difference here is that we are introducing a set of new sysfs file for
>> virtual device discovery and life cycle management due to its virtual nature.
>>
>> Here is the summary of the sysfs, when they will be created and how they should
>> be used:
>>
>> 1. Discover mediated device
>>
>> As part of physical device initialization process, vendor driver will register
>> their physical devices, which will be used to create virtual device (mediated
>> device, aka mdev) to the mediated framework.
>>
>> Then, the sysfs file "mdev_supported_types" will be available under the physical
>> device sysfs, and it will indicate the supported mdev and configuration for this
>> particular physical device, and the content may change dynamically based on the
>> system's current configurations, so libvirt needs to query this file every time
>> before create a mdev.
>>
>> Note: different vendors might have their own specific configuration sysfs as
>> well, if they don't have pre-defined types.
>>
>> For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
>> NVIDIA specific configuration on an idle system.
>>
>> For example, to query the "mdev_supported_types" on this Tesla M60:
>>
>> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
>> max_resolution
>> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
>> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
>> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
>> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
>> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
>> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
>> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
>> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> I'm unclear on the requirements about data format for this file.
> Looking at the docs:
>
>    http://www.spinics.net/lists/kvm/msg136476.html
>
> the format is completely unspecified.
>
>> 2. Create/destroy mediated device
>>
>> Two sysfs files are available under the physical device sysfs path : mdev_create
>> and mdev_destroy
>>
>> The syntax of creating a mdev is:
>>
>>      echo "$mdev_UUID:vendor_specific_argument_list" >
>> /sys/bus/pci/devices/.../mdev_create
> I'm not really a fan of the idea of having to provide arbitrary vendor
> specific arguments to the mdev_create call, as I don't really want to
> have to create vendor specific code for each vendor's vGPU hardware in
> libvirt.
>
> What is the relationship between the mdev_supported_types data and
> the vendor_specific_argument_list requirements ?
>
>
>> The syntax of destroying a mdev is:
>>
>>      echo "$mdev_UUID:vendor_specific_argument_list" >
>> /sys/bus/pci/devices/.../mdev_destroy
>>
>> The $mdev_UUID is a unique identifier for this mdev device to be created, and it
>> is unique per system.
>>
>> For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
>> above Tesla M60 output), and a VM UUID to be passed as
>> "vendor_specific_argument_list".
>>
>> If there is no vendor specific arguments required, either "$mdev_UUID" or
>> "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
> This raises the question of how an application discovers what
> vendor specific arguments are required or not, and what they
> might mean.
>
>> To create a M60-4Q device, libvirt needs to do:
>>
>>      echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
>> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
> Overall it doesn't seem like the proposed kernel interfaces provide
> enough vendor abstraction to be able to use this functionality without
> having to create vendor specific code in libvirt, which is something
> I want to avoid us doing.
>
>
>
> Ignoring the details though, in terms of libvirt integration, I think I'd
> see us primarily doing work in the node device APIs / XML. Specifically
> for physical devices, we'd have to report whether they support the
> mediated device feature and some way to enumerate the validate device
> types that can be created. The node device creation API would have to
> support create/deletion of the devices (mapping to mdev_create/destroy)
>
>
> When configuring a guest VM, we'd use the <hostdev> XML to point to one
> or more mediated devices that have been created via the node device APIs
> previously.

I'd originally thought of this as having two separate points of support 
in libvirt as well:

In the node device driver:

   * reporting of mdev capabilities in the nodedev-dumpxml output of any 
physdev (would this be adequate for discovery?  It would, after all, 
require doing a nodedev-list of all devices, then nodedev-dumpxml of 
every PCI device to search the XML for presence of this capability)

  * new APIs to start a pool of mdevs and destroy a pool of mdevs ( 
would virNodeDeviceCreateXML()/virNodeDeviceDestroy() be adequate for 
this? They create/destroy just a single device, so would need to be 
called multiple times, once for each mdev, which seems a bit ugly, 
although accurate)

  * the addition of persistent config objects in the node device driver 
that can be started/destroyed/set to autostart [*]

In the qemu driver:

  * some additional attributes in <hostdev> to point to a particular 
pool of mdevs managed by the node device driver

  * code to turn those new hostdev attributes into the proper mdev 
start/stop sequence, and qemu commandline option or QMP command

After learning that the GPU driver on the host was already doing all 
systemwide initialization, I began thinking that maybe (as Neo suggests) 
we could get by without the 2nd and 3rd items in the list for the node 
device driver - instead doing something more analogous to <hostdev 
managed='yes'>, where the mdev creation happens on demand (just like 
binding of a PCI device to the vfio-pci driver happens on demand).

I still have an uneasy feeling about creating mdevs on demand at domain 
startup though because, as I pointed out in my previous email in this 
thread, one problem is that while a GPU may be *potentially* capable of 
supporting several different models of vGPU, once the first vGPU is 
created, all subsequent vGPUs are restricted to  being the same model as 
the first, which could lead to unexpected surprises.

On the other hand, on-demand creation could be seen as more flexible, 
especially if the GPU driver were to ever gain the ability to have 
heterogenous collections of vGPUs. I also wonder how much of a resource 
burden it is to have a bunch of unused mdevs sitting around - is there 
any performance (or memory usage) disadvantage to having e.g. 16 vGPUs 
created vs  2, if only 2 are currently in use?

========

[*] Currently the node device driver has virNodeDeviceCreateXML() and 
virNodeDeviceDestroy(), but those are so far only used to tell udev to 
create fiber channel "vports", and there is no persistent config stored 
in libvirt for this - (does udev store persistent config for it? Or must 
it be redone at each host system reboot?). There is no place to define a 
set of devices that should be automatically created at boot time / 
libvirtd start  (i.e. analogous to virNetworkDefineFlags() + setting 
autostart for a network). This is what would be needed - 
virNodeDeviceDefineFlags() (and accompanying persistent object storage), 
virNodeDeviceSetAutostart(), and virNodeDeviceGetAutostart().

(NB: this could also be useful for setting the max. # of VFs for an 
SRIOV PF, although it's unclear exactly *how* - the old method of doing 
that (kernel driver module commandline arguments) is non-standard and 
deprecated, and there is no standard location for persistent config to 
set it up using the new method (sysfs)). In this case, the device 
already exists (the SRIOV PF), and it just needs one of its sysfs 
entries modified (it wouldn't really make sense to nodedev-define each 
VF separately, because the kernel API just doesn't work that way - you 
don't add each VF separately, you just set sriov_numvfs in the PF's 
sysfs to 0, then set it to the number of VFs that are desired. So I'm 
not sure how to shoehorn that into the idea of "creating a new node device")