[libvirt] [PATCH 0/3] sample: vfio mdev display devices.

Thu Apr 26 18:55:23 UTC 2018

* Kirti Wankhede (kwankhede at nvidia.com) wrote:
> 
> 
> On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote:
> > * Alex Williamson (alex.williamson at redhat.com) wrote:
> >> On Wed, 25 Apr 2018 21:00:39 +0530
> >> Kirti Wankhede <kwankhede at nvidia.com> wrote:
> >>
> >>> On 4/25/2018 4:29 AM, Alex Williamson wrote:
> >>>> On Wed, 25 Apr 2018 01:20:08 +0530
> >>>> Kirti Wankhede <kwankhede at nvidia.com> wrote:
> >>>>   
> >>>>> On 4/24/2018 3:10 AM, Alex Williamson wrote:  
> >>>>>> On Wed, 18 Apr 2018 12:31:53 -0600
> >>>>>> Alex Williamson <alex.williamson at redhat.com> wrote:
> >>>>>>     
> >>>>>>> On Mon,  9 Apr 2018 12:35:10 +0200
> >>>>>>> Gerd Hoffmann <kraxel at redhat.com> wrote:
> >>>>>>>    
> >>>>>>>> This little series adds three drivers, for demo-ing and testing vfio
> >>>>>>>> display interface code.  There is one mdev device for each interface
> >>>>>>>> type (mdpy.ko for region and mbochs.ko for dmabuf).      
> >>>>>>>
> >>>>>>> Erik Skultety brought up a good question today regarding how libvirt is
> >>>>>>> meant to handle these different flavors of display interfaces and
> >>>>>>> knowing whether a given mdev device has display support at all.  It
> >>>>>>> seems that we cannot simply use the default display=auto because
> >>>>>>> libvirt needs to specifically configure gl support for a dmabuf type
> >>>>>>> interface versus not having such a requirement for a region interface,
> >>>>>>> perhaps even removing the emulated graphics in some cases (though I
> >>>>>>> don't think we have boot graphics through either solution yet).
> >>>>>>> Additionally, GVT-g seems to need the x-igd-opregion support
> >>>>>>> enabled(?), which is a non-starter for libvirt as it's an experimental
> >>>>>>> option!
> >>>>>>>
> >>>>>>> Currently the only way to determine display support is through the
> >>>>>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on
> >>>>>>> their own they'd need to get to the point where they could open the
> >>>>>>> vfio device and perform the ioctl.  That means opening a vfio
> >>>>>>> container, adding the group, setting the iommu type, and getting the
> >>>>>>> device.  I was initially a bit appalled at asking libvirt to do that,
> >>>>>>> but the alternative is to put this information in sysfs, but doing that
> >>>>>>> we risk that we need to describe every nuance of the mdev device
> >>>>>>> through sysfs and it becomes a dumping ground for every possible
> >>>>>>> feature an mdev device might have.
> >> ...    
> >>>>>>> So I was ready to return and suggest that maybe libvirt should probe
> >>>>>>> the device to know about these ancillary configuration details, but
> >>>>>>> then I remembered that both mdev vGPU vendors had external dependencies
> >>>>>>> to even allow probing the device.  KVMGT will fail to open the device
> >>>>>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I
> >>>>>>> believe, will fail if the vGPU manager process cannot find the QEMU
> >>>>>>> instance to extract the VM UUID.  (Both of these were bad ideas)    
> >>>>>>
> >>>>>> Here's another proposal that's really growing on me:
> >>>>>>
> >>>>>>  * Fix the vendor drivers!  Allow devices to be opened and probed
> >>>>>>    without these external dependencies.
> >>>>>>  * Libvirt uses the existing vfio API to open the device and probe the
> >>>>>>    necessary ioctls, if it can't probe the device, the feature is
> >>>>>>    unavailable, ie. display=off, no migration.
> >>>>>>     
> >>>>>
> >>>>> I'm trying to think simpler mechanism using sysfs that could work for
> >>>>> any feature and knowing source-destination migration compatibility check
> >>>>> by libvirt before initiating migration.
> >>>>>
> >>>>> I have another proposal:
> >>>>> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES
> >>>>> struct vfio_device_features {
> >>>>>     __u32 argsz;
> >>>>>     __u32 features;
> >>>>> }
> >>>>>
> >>>>> Define bit for each feature:
> >>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION	(1 << 0)
> >>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF	(1 << 1)
> >>>>> #define VFIO_DEVICE_FEATURE_MIGRATION		(1 << 2)
> >>>>>
> >>>>> * Vendor driver returns bitmask of supported features during
> >>>>> initialization phase.
> >>>>>
> >>>>> * In vfio core module, trap this ioctl for each device  in
> >>>>> vfio_device_fops_unl_ioctl(),  
> >>>>
> >>>> Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our
> >>>> blocking point with mdev drivers, we can't get a device fd, so we can't
> >>>> call an ioctl on the device fd.
> >>>>   
> >>>
> >>> I'm sorry, I thought we could expose features when QEMU initialize, but
> >>> libvirt needs to know supported features before QEMU initialize.
> >>>
> >>>
> >>>>> check features bitmask returned by vendor
> >>>>> driver and add a sysfs file if feature is supported that device. This
> >>>>> sysfs file would return 0/1.  
> >>>>
> >>>> I don't understand why we have an ioctl interface, if the user can get
> >>>> to the device fd then we have existing interfaces to probe these
> >>>> things, it seems like you're just wanting to pass a features bitmap
> >>>> through to vfio_add_group_dev() that vfio-core would expose through
> >>>> sysfs, but a list of feature bits doesn't convey enough info except for
> >>>> the most basic uses.
> >>>>    
> >>>
> >>> Yes, vfio_add_group_dev() seems to be better way to convey features to
> >>> vfio core.
> >>>
> >>>>> For migration this bit will only indicate if host driver supports
> >>>>> migration feature.
> >>>>>
> >>>>> For source and destination compatibility check libvirt would need more
> >>>>> data/variables to check like,
> >>>>> * if same type of 'mdev_type' device create-able at destination,
> >>>>>    i.e. if ('mdev_type'->available_instances > 0)
> >>>>>
> >>>>> * if host_driver_version at source and destination are compatible.
> >>>>> Host driver from same release branch should be mostly compatible, but if
> >>>>> there are major changes in structures or APIs, host drivers from
> >>>>> different branches might not be compatible, for example, if source and
> >>>>> destination are from different branches and one of the structure had
> >>>>> changed, then data collected at source might not be compatible with
> >>>>> structures at destination and typecasting it to changed structures would
> >>>>> mess up migrated data during restoration.  
> >>>>
> >>>> Of course now you're asking that libvirt understand the release
> >>>> versioning scheme of every vendor driver and that it remain
> >>>> programatically consistent.  We can't even do this with in-kernel
> >>>> drivers.  And in the end, still the best we can do is guess.
> >>>>  
> >>>
> >>> Libvirt doesn't need to understand the version, libvirt need to do
> >>> strcmp version string from source and destination. If those are equal,
> >>> then libvirt would understand that they are compatible.
> >>
> >> Who's to say that the driver version and migration compatibility have
> >> any relation at all?  Some drivers might focus on designing their own
> >> migration interface that can maintain compatibility across versions
> >> (QEMU does this), some drivers may only allow identical version
> >> migration (which is going to frustrate upper level management tools and
> >> customers - RHEL goes to great extents to support cross version
> >> migration).  We cannot have a one size fits all here that driver version
> >> defines completely the migration compatibility.
> > 
> > I'll agree; I don't know enough about these devices, but to give you
> > some example of things I'd expect to work:
> >    a) User adds new machines to their data centre with larger/newer
> > version of the same vendors GPU; in some cases that should work
> > (depending on vendor details etc)
> >    b) The same thing but with identical hardware but a newer driver on
> > the destination.
> > 
> > Obviously there will be some cut offs that say some versions are
> > incompatible;  but for normal migration we jump through serious hoops
> > to make sure stuff works; customers will expect the same with some
> > VFIO devices.
> > 
> 
> How libvirt checks that cut off where some versions are incompatible?

We have versioned 'machine types' - so for example QEMU has
  pc-i440fx-2.11
  pc-i440fx-2.10

machine types; any version of qemu that supports machine type
pc-i440fx-2.10 should behave the same to it's emulated devices.
If we change the behaviour then we tie it to the new machine type;
so the behaviour of a device in pc-i440fx-2.11 might be a bit different.
Occasionally we'll kill off old machine types; (actually we should do it
more!) - but certainly when we do downstream versions we tie it to
machine types as well.

We also have some migration-capability flags, so some features can only
be used if both sides have that flag, and also Libvirt has some checking
of host CPU flags.

> >>>>> * if guest_driver_version is compatible with host driver at destination.
> >>>>> For mdev devices, guest driver communicates with host driver in some
> >>>>> form. If there are changes in structures/APIs of such communication,
> >>>>> guest driver at source might not be compatible with host driver at
> >>>>> destination.  
> >>>>
> >>>> And another guess plus now the guest driver is involved which libvirt
> >>>> has no visibility to.
> >>>>    
> >>>
> >>> Like above libvirt need to do strcmp.
> >>
> >> Insufficient, imo
> >>
> >>>>> 'available_instances' sysfs already exist, later two should be added by
> >>>>> vendor driver which libvirt can use for migration compatibility check.  
> >>>>
> >>>> As noted previously, display and migration are not necessarily
> >>>> mdev-only features, it's possible that vfio-pci or vfio-platform could
> >>>> also implement these, so the sysfs interface cannot be restricted to
> >>>> the mdev template and lifecycle interface.
> >>>>   
> >>>
> >>> I agree.
> >>> Feature bitmask passed to vfio core is not mdev specific. But here
> >>> 'available_instances' for migration compatibility check is mdev
> >>> specific. If mdev device is not create-able at destination, there is no
> >>> point in initiating migration by libvirt.
> >>
> >> 'available_instances' for migration compatibility check...?  We use
> >> available_instances to know whether we have the resources to create a
> >> given mdev type.  It's certainly a prerequisite to have a device of the
> >> identical type at the migration target and how we define what is an
> >> identical device for a directly assigned PCI device is yet another
> >> overly complicated rat hole.  But an identical device doesn't
> >> necessarily imply migration compatibility and I think that's the
> >> problem we're tackling.  We cannot assume based only on the device type
> >> that migration is compatible, that's basically saying we're never going
> >> to have any bugs or oversights or new features in the migration stream.
> > 
> > Those things certainly happen; state that we forgot to transfer, new
> > features enables on devices, devices configured in different ways.
> > 
> 
> How libvirt checks migration compatibility for other devices across QEMU
> versions where source support a device and destination running with
> older QEMU version doesn't support that device or that device doesn't
> exist in that system?

Libvirt inspects the qemu to get lists of devices and capabilities; I'll
leave it to the libvirt guys to add more detail if needed.

Dave

> Thanks,
> Kirti
> 
> >> Chatting with Laine, it may be worth a step back to include migration
> >> experts and people up the stack with more visibility to how openstack
> >> operates.  The issue here is that if vfio gains migration support then
> >> we have a portion of the migration stream that is not under the control
> >> of QEMU, we cannot necessarily tie it to a QEMU machine type and we
> >> cannot necessarily dictate how the vfio bus driver (vendor driver)
> >> handles versioning and compatibility.  My intent was to expose some
> >> sort of migration information through the vfio API so that upper level
> >> tools could determine source and target compatibility, but this in
> >> itself is I think something new that those tools need to agree how it
> >> might be done.  How would something like openstack want to handle not
> >> only finding a migration target with a compatible device, but also
> >> verifying if the device supports the migration format of the source
> >> device?
> >>
> >> Alternatively, should we do anything?  Is the problem too hard and we
> >> should let the driver return an error when it receives an incompatible
> >> migration stream, aborting the migration?
> > 
> > It's a bit nasty; if you've hit the 'evacuate host' button then what
> > happens when you've got some incompatible hosts.
> > 
> > Dave
> > 
> >>>> One more try... we have a vfio_group fd.  This is created by the bus
> >>>> drivers calling vfio_add_group_dev() and registers a struct device, a
> >>>> struct vfio_device_ops, and private data.  Typically we only wire the
> >>>> device_ops to the resulting file descriptor we get from
> >>>> VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl
> >>>> through the group fd?  The ioctl would need to take a string arg to
> >>>> match to a device name, plus an ioctl cmd and arg for the device_ops
> >>>> ioctl.  The group ioctl would need to filter cmds to known, benign
> >>>> queries.  We'd also need to verify that the allowed ioctls have no
> >>>> dependencies on setup done in device_ops.open().  
> >>>
> >>> So these ioctls would be called without devices open() call, doesn't
> >>> this seem to be against file operations standard?
> >>
> >> vfio_device_ops is modeled largely after file operations, but I don't
> >> think we're bound by that for the interaction between vfio-core and the
> >> vfio bus drivers.  We could make a separate callback for unprivileged
> >> ioctls, but that seems like more work per driver when we really want to
> >> maintain the identical API, we just want to provide a more limited
> >> interface and change the calling point.
> >>
> >> An issue I thought of for migration though is that this path wouldn't
> >> have access to the migration region and therefore if we place a header
> >> within that region containing the compatibility and versioning
> >> information, the user still couldn't access it.  This doesn't seem to
> >> be a blocker though as we could put that information within the region
> >> capability that defines the region as used for migration.  Possibly a
> >> device could have multiple migration regions with different formats
> >> for backwards compatibility, of course then we'd need a way to
> >> determine which to use and which combinations have been validated.
> >> Thanks,
> >>
> >> Alex
> > --
> > Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK
> > 
--
Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK