[libvirt] [RFC] libvirt vGPU QEMU integration

Fri Aug 19 12:42:27 UTC 2016

On 18.08.2016 18:41, Neo Jia wrote:
> Hi libvirt experts,

Hi, welcome to the list.

> 
> I am starting this email thread to discuss the potential solution / proposal of
> integrating vGPU support into libvirt for QEMU.
> 
> Some quick background, NVIDIA is implementing a VFIO based mediated device
> framework to allow people to virtualize their devices without SR-IOV, for
> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> VFIO API to process the memory / interrupt as what QEMU does today with passthru
> device.

So as far as I understand, this is solely NVIDIA's API and other vendors
(e.g. Intel) will use their own or is this a standard that others will
comply to?

> 
> The difference here is that we are introducing a set of new sysfs file for
> virtual device discovery and life cycle management due to its virtual nature.
> 
> Here is the summary of the sysfs, when they will be created and how they should
> be used:
> 
> 1. Discover mediated device
> 
> As part of physical device initialization process, vendor driver will register
> their physical devices, which will be used to create virtual device (mediated
> device, aka mdev) to the mediated framework.
> 
> Then, the sysfs file "mdev_supported_types" will be available under the physical
> device sysfs, and it will indicate the supported mdev and configuration for this 
> particular physical device, and the content may change dynamically based on the
> system's current configurations, so libvirt needs to query this file every time
> before create a mdev.

Ah, that was gonna be my question. Because in the example below, you
used "echo '...vgpu_type_id=20...' > /sys/bus/.../mdev_create". And I
was wondering where does the number 20 come from. Now what I am
wondering about is how libvirt should expose these to users. Moreover,
how it should let users to chose.
We have a node device driver where I guess we could expose possible
options and then require some explicit value in the domain XML (but what
value would that be? I don't think taking vgpu_type_id-s as they are
would be a great idea).

> 
> Note: different vendors might have their own specific configuration sysfs as
> well, if they don't have pre-defined types.
> 
> For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
> NVIDIA specific configuration on an idle system.
> 
> For example, to query the "mdev_supported_types" on this Tesla M60:
> 
> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> 
> 2. Create/destroy mediated device
> 
> Two sysfs files are available under the physical device sysfs path : mdev_create
> and mdev_destroy
> 
> The syntax of creating a mdev is:
> 
>     echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_create
> 
> The syntax of destroying a mdev is:
> 
>     echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_destroy
> 
> The $mdev_UUID is a unique identifier for this mdev device to be created, and it
> is unique per system.

Ah, so a caller (the one doing the echo - e.g. libvirt) can generate
their own UUID under which the mdev will be known? I'm asking because of
migration - we might want to preserve UUIDs when a domain is migrated to
the other side. Speaking of which, is there such limitation or will
guest be able to migrate even if UUID's changed?

> 
> For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> above Tesla M60 output), and a VM UUID to be passed as
> "vendor_specific_argument_list".

I understand the need for vgpu_type_id, but can you shed more light on
the VM UUID? Why is that required?

> 
> If there is no vendor specific arguments required, either "$mdev_UUID" or
> "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
> 
> To create a M60-4Q device, libvirt needs to do:
> 
>     echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
> 
> Then, you will see a virtual device shows up at:
> 
>     /sys/bus/mdev/devices/$mdev_UUID/
> 
> For NVIDIA, to create multiple virtual devices per VM, it has to be created
> upfront before bringing any of them online.
> 
> Regarding error reporting and detection, on failure, write() to sysfs using fd
> returns error code, and write to sysfs file through command prompt shows the
> string corresponding to error code.
> 
> 3. Start/stop mediated device
> 
> Under the virtual device sysfs, you will see a new "online" sysfs file.
> 
> you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
> of this virtual device (0 or 1), and to start a virtual device or stop a virtual 
> device you can do:
> 
>     echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
> 
> libvirt needs to query the current state before changing state.
> 
> Note: if you have multiple devices, you need to write to the "online" file
> individually.
> 
> For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
> them "online" before starting QEMU.

This is a valid requirement, indeed.

> 
> 4. Launch QEMU/VM
> 
> Pass the mdev sysfs path to QEMU as vfio-pci device:
> 
>     -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0

One question here. Libvirt allows users to run qemu under different
user:group than root:root. If that's the case, libvirt sets security
labels on all files qemu can/will touch. Are we going to need to do
something in that respect here?

> 
> 5. Shutdown sequence 
> 
> libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
> virtual device
> 
> 6. VM Reset
> 
> No change or requirement for libvirt as this will be handled via VFIO reset API
> and QEMU process will keep running as before.
> 
> 7. Hot-plug
> 
> It optional for vendors to support hot-plug.
> 
> And it is same syntax to create a virtual device for hot-plug. 
> 
> For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
> to write to "destroy" sysfs to complete hot-unplug process.
> 
> Since hot-plug is optional, then mdev_create or mdev_destroy operations may
> return an error if it is not supported.

Thank you for very detailed description! In general, I like the API as
it looks usable from my POV (I'm no VFIO devel though).

Michal