[libvirt] [RFC] libvirt vGPU QEMU integration

Mon Aug 22 05:40:30 UTC 2016

On Fri, Aug 19, 2016 at 02:42:27PM +0200, Michal Privoznik wrote:
> On 18.08.2016 18:41, Neo Jia wrote:
> > Hi libvirt experts,
> 
> Hi, welcome to the list.
> 
> > 
> > I am starting this email thread to discuss the potential solution / proposal of
> > integrating vGPU support into libvirt for QEMU.
> > 
> > Some quick background, NVIDIA is implementing a VFIO based mediated device
> > framework to allow people to virtualize their devices without SR-IOV, for
> > example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing the
> > VFIO API to process the memory / interrupt as what QEMU does today with passthru
> > device.
> 
> So as far as I understand, this is solely NVIDIA's API and other vendors
> (e.g. Intel) will use their own or is this a standard that others will
> comply to?

Hi Michal,

Based on the initial vGPU VFIO design discussion thread on QEMU mailing, I
believe this is what both NVIDIA, Intel and even other companies will comply to.

(People from related parties are CC'ed in this email, such as Intel and IBM.)

As you know, I can't speak for Intel, so I would like to defer this question to
them, but above is my understanding based on the QEMU/KVM community discussions.

> 
> > 
> > The difference here is that we are introducing a set of new sysfs file for
> > virtual device discovery and life cycle management due to its virtual nature.
> > 
> > Here is the summary of the sysfs, when they will be created and how they should
> > be used:
> > 
> > 1. Discover mediated device
> > 
> > As part of physical device initialization process, vendor driver will register
> > their physical devices, which will be used to create virtual device (mediated
> > device, aka mdev) to the mediated framework.
> > 
> > Then, the sysfs file "mdev_supported_types" will be available under the physical
> > device sysfs, and it will indicate the supported mdev and configuration for this 
> > particular physical device, and the content may change dynamically based on the
> > system's current configurations, so libvirt needs to query this file every time
> > before create a mdev.
> 
> Ah, that was gonna be my question. Because in the example below, you
> used "echo '...vgpu_type_id=20...' > /sys/bus/.../mdev_create". And I
> was wondering where does the number 20 come from. Now what I am
> wondering about is how libvirt should expose these to users. Moreover,
> how it should let users to chose.
> We have a node device driver where I guess we could expose possible
> options and then require some explicit value in the domain XML (but what
> value would that be? I don't think taking vgpu_type_id-s as they are
> would be a great idea).

Right, the vgpu_type_id is just a handle for a given type of vGPU device for
NVIDIA case.  How about expose the "vgpu_type" which is a meaningful name
for the vGPU end users?

Also, when you are saying "let users to chose", does this mean to expose some
virsh command to allow user to dump their potential virtual devices and pick
one?

> 
> > 
> > Note: different vendors might have their own specific configuration sysfs as
> > well, if they don't have pre-defined types.
> > 
> > For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here is
> > NVIDIA specific configuration on an idle system.
> > 
> > For example, to query the "mdev_supported_types" on this Tesla M60:
> > 
> > cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> > # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> > max_resolution
> > 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> > 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> > 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> > 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> > 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> > 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> > 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> > 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> > 
> > 2. Create/destroy mediated device
> > 
> > Two sysfs files are available under the physical device sysfs path : mdev_create
> > and mdev_destroy
> > 
> > The syntax of creating a mdev is:
> > 
> >     echo "$mdev_UUID:vendor_specific_argument_list" >
> > /sys/bus/pci/devices/.../mdev_create
> > 
> > The syntax of destroying a mdev is:
> > 
> >     echo "$mdev_UUID:vendor_specific_argument_list" >
> > /sys/bus/pci/devices/.../mdev_destroy
> > 
> > The $mdev_UUID is a unique identifier for this mdev device to be created, and it
> > is unique per system.
> 
> Ah, so a caller (the one doing the echo - e.g. libvirt) can generate
> their own UUID under which the mdev will be known? I'm asking because of
> migration - we might want to preserve UUIDs when a domain is migrated to
> the other side. Speaking of which, is there such limitation or will
> guest be able to migrate even if UUID's changed?

Yes, and as long as the MDEV UUID is unique per system and even that gets
changed between migration process, it should be fine.

> 
> > 
> > For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> > above Tesla M60 output), and a VM UUID to be passed as
> > "vendor_specific_argument_list".
> 
> I understand the need for vgpu_type_id, but can you shed more light on
> the VM UUID? Why is that required?

Sure, this is required by NVIDIA vGPU, especially to support multiple vGPU devices per
VM as we have a SW entity to manage all vGPU devices per VM, it will also
reserve special GPU resources for multiple vGPU per VM cases.

> 
> > 
> > If there is no vendor specific arguments required, either "$mdev_UUID" or
> > "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
> > 
> > To create a M60-4Q device, libvirt needs to do:
> > 
> >     echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> > /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
> > 
> > Then, you will see a virtual device shows up at:
> > 
> >     /sys/bus/mdev/devices/$mdev_UUID/
> > 
> > For NVIDIA, to create multiple virtual devices per VM, it has to be created
> > upfront before bringing any of them online.
> > 
> > Regarding error reporting and detection, on failure, write() to sysfs using fd
> > returns error code, and write to sysfs file through command prompt shows the
> > string corresponding to error code.
> > 
> > 3. Start/stop mediated device
> > 
> > Under the virtual device sysfs, you will see a new "online" sysfs file.
> > 
> > you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current status
> > of this virtual device (0 or 1), and to start a virtual device or stop a virtual 
> > device you can do:
> > 
> >     echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
> > 
> > libvirt needs to query the current state before changing state.
> > 
> > Note: if you have multiple devices, you need to write to the "online" file
> > individually.
> > 
> > For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
> > them "online" before starting QEMU.
> 
> This is a valid requirement, indeed.

Thanks!

> 
> > 
> > 4. Launch QEMU/VM
> > 
> > Pass the mdev sysfs path to QEMU as vfio-pci device:
> > 
> >     -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0
> 
> One question here. Libvirt allows users to run qemu under different
> user:group than root:root. If that's the case, libvirt sets security
> labels on all files qemu can/will touch. Are we going to need to do
> something in that respect here?

As long as QEMU uses VFIO API and doesn't do anything extra for any particular
vendor, there shouldn't be any problem at QEMU side. So I don't see any issues
here.

But I would like to test it out with the proper setting for NVIDIA vGPU case.
Currently all our testing is using sysfs and launch QEMU directly, if I just
mimic how libvirt launches QEMU for normal VFIO passthru device, will that
cover the selinux label concerns?

Thanks,
Neo

> 
> > 
> > 5. Shutdown sequence 
> > 
> > libvirt needs to shutdown the qemu, bring the virtual device offline, then destroy the
> > virtual device
> > 
> > 6. VM Reset
> > 
> > No change or requirement for libvirt as this will be handled via VFIO reset API
> > and QEMU process will keep running as before.
> > 
> > 7. Hot-plug
> > 
> > It optional for vendors to support hot-plug.
> > 
> > And it is same syntax to create a virtual device for hot-plug. 
> > 
> > For hot-unplug, after executing QEMU monitor "device del" command, libvirt needs
> > to write to "destroy" sysfs to complete hot-unplug process.
> > 
> > Since hot-plug is optional, then mdev_create or mdev_destroy operations may
> > return an error if it is not supported.
> 
> Thank you for very detailed description! In general, I like the API as
> it looks usable from my POV (I'm no VFIO devel though).
> 
> Michal