[libvirt] [RFC] libvirt vGPU QEMU integration

Fri Aug 19 19:22:48 UTC 2016

On 08/18/2016 12:41 PM, Neo Jia wrote:
> Hi libvirt experts,
>
> I am starting this email thread to discuss the potential solution / proposal of
> integrating vGPU support into libvirt for QEMU.

Thanks for the detailed description. This is very helpful.

>
> Some quick background, NVIDIA is implementing a VFIO based mediated device
> framework to allow people to virtualize their devices without SR-IOV, for
> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> VFIO API to process the memory / interrupt as what QEMU does today with passthru
> device.
>
> The difference here is that we are introducing a set of new sysfs file for
> virtual device discovery and life cycle management due to its virtual nature.
>
> Here is the summary of the sysfs, when they will be created and how they should
> be used:
>
> 1. Discover mediated device
>
> As part of physical device initialization process, vendor driver will register
> their physical devices, which will be used to create virtual device (mediated
> device, aka mdev) to the mediated framework.

We've discussed this question offline, but I just want to make sure I 
understood correctly - all initialization of the physical device on the 
host is already handled "elsewhere", so libvirt doesn't need to be 
concerned with any physical device lifecycle or configuration (setting 
up the number or types of vGPUs), correct? Do you think this would also 
be the case for other vendors using the same APIs? I guess this all 
comes down to whether or not the setup of the physical device is defined 
within the bounds of the common infrastructure/API, or if it's something 
that's assumed to have just magically happened somewhere else.

>
> Then, the sysfs file "mdev_supported_types" will be available under the physical
> device sysfs, and it will indicate the supported mdev and configuration for this
> particular physical device, and the content may change dynamically based on the
> system's current configurations, so libvirt needs to query this file every time
> before create a mdev.

I had originally thought that libvirt would be setting up and managing a 
pool of virtual devices, similar to what we currently do with SRIOV VFs. 
But from this it sounds like the management of this pool is completely 
handled by your drivers (especially since the contents of the pool can 
apparently completely change at any instant). In one way that makes life 
easier for libvirt, because it doesn't need to manage anything.

On the other hand, it makes thing less predictable. For example, when 
libvirt defines a domain, it queries the host system to see what types 
of devices are legal in guests on this host, and expects those devices 
to be available at a later time. As I understand it (and I may be 
completely wrong), when no vGPUs are running on the hardware, there is a 
choice of several different models of vGPU (like the example you give 
below), but when the first vGPU is started up, that triggers the host 
driver to restrict the available models. If that's the case, then a 
particular vGPU could be "available" when a domain is defined, but not 
an option by the time the domain is started. That's not a show stopper, 
but I want to make sure I am understanding everything properly.

Also, is there any information about the maximum number of vGPUs that 
can be handled by a particular physical device (I think that changes 
based on which model of vGPU is being used, right?) Or maybe what is the 
current "load" on a physical device, in case there is more than one and 
libvirt (or management) wants to make a decision about which one to use?

>
> Note: different vendors might have their own specific configuration sysfs as
> well, if they don't have pre-defined types.
>
> For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
> NVIDIA specific configuration on an idle system.
>
> For example, to query the "mdev_supported_types" on this Tesla M60:
>
> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
>
> 2. Create/destroy mediated device
>
> Two sysfs files are available under the physical device sysfs path : mdev_create
> and mdev_destroy
>
> The syntax of creating a mdev is:
>
>      echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_create
>
> The syntax of destroying a mdev is:
>
>      echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_destroy
>
> The $mdev_UUID is a unique identifier for this mdev device to be created, and it
> is unique per system.

Is there any reason to try to maintain the same UUID from one run to the 
next? Or should we completely think of this as a cookie for this time 
only (so more like a file handle, but we get to pick the value)? (Michal 
has asked about this in relation to migration, but the question also 
applies in the general situation of simply stopping and restarting a guest).

Also, is it enforced that "UUID" actually be a 128 bit UUID, or can it 
be any unique string?

>
> For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> above Tesla M60 output), and a VM UUID to be passed as
> "vendor_specific_argument_list".
>
> If there is no vendor specific arguments required, either "$mdev_UUID" or
> "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
>
> To create a M60-4Q device, libvirt needs to do:
>
>      echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
>
> Then, you will see a virtual device shows up at:
>
>      /sys/bus/mdev/devices/$mdev_UUID/
>
> For NVIDIA, to create multiple virtual devices per VM, it has to be created
> upfront before bringing any of them online.
>
> Regarding error reporting and detection, on failure, write() to sysfs using fd
> returns error code, and write to sysfs file through command prompt shows the
> string corresponding to error code.
>
> 3. Start/stop mediated device
>
> Under the virtual device sysfs, you will see a new "online" sysfs file.
>
> you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
> of this virtual device (0 or 1), and to start a virtual device or stop a virtual
> device you can do:
>
>      echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
>
> libvirt needs to query the current state before changing state.
>
> Note: if you have multiple devices, you need to write to the "online" file
> individually.
>
> For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
> them "online" before starting QEMU.
>
> 4. Launch QEMU/VM
>
> Pass the mdev sysfs path to QEMU as vfio-pci device:
>
>      -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0

1) I have the same question as Michal - you're passing the path to the 
sysfs directory for the device to qemu, which implies that the qemu 
process will need to open/close/read/write files in that directory. 
Since libvirt is running as root, it can easily do that, but libvirt 
then runs the qemu process under a different uid and with a different 
selinux context. We need to make sure that we can change the uid/selinux 
labelling of the items in sysfs without adverse effect elsewhere.

Also it's important that qemu doesn't need to access anything else 
outside of this device-specific directory (each qemu process is running 
with different selinux labeling and potentially a different uid:gid, so 
if there is any common file/device node that must be accessed directly 
by qemu, it would need to be safely globally readable/writable.

How does this device show up in the guest?  guess it's a PCI device 
(since you're using vfio-pci :-), and all the standard options for 
setting PCI address apply. And is this device legacy PCI, or PCI 
Express? (Or perhaps it changes behavior depending on the type of slot 
used in the guest?)

>
> 5. Shutdown sequence
>
> libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
> virtual device
>
> 6. VM Reset
>
> No change or requirement for libvirt as this will be handled via VFIO reset API
> and QEMU process will keep running as before.
>
> 7. Hot-plug
>
> It optional for vendors to support hot-plug.
>
> And it is same syntax to create a virtual device for hot-plug.
>
> For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
> to write to "destroy" sysfs to complete hot-unplug process.
>
> Since hot-plug is optional, then mdev_create or mdev_destroy operations may
> return an error if it is not supported.

 From what I understand here, it sounds like what's needed from libvirt is

1) exposing enough info in the output of nodedev-dumpxml for an 
application to use it to determine which devices are capable of creating 
vGPUs, and which models of vGPU they can create.

  2) to create+start (then later stop+destroy) individual vGPUs based on 
[something] in the domain XML. So the question that remains is how to 
put it in the domain config. My first instinct was to use some variation 
of <hostdev> (since the backend of it is vfio-pci), but on the other 
hand hostdev is usually used to take one device that could be used by 
the host, take it away from the host, and give it to the guest, and 
that's not exactly what's happening here. So I wonder if there would be 
any advantage to making this another model of <video> instead.