[libvirt] [PATCH 0/3] sample: vfio mdev display devices.

Thu Apr 26 18:45:01 UTC 2018

On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote:
> * Alex Williamson (alex.williamson at redhat.com) wrote:
>> On Wed, 25 Apr 2018 21:00:39 +0530
>> Kirti Wankhede <kwankhede at nvidia.com> wrote:
>>
>>> On 4/25/2018 4:29 AM, Alex Williamson wrote:
>>>> On Wed, 25 Apr 2018 01:20:08 +0530
>>>> Kirti Wankhede <kwankhede at nvidia.com> wrote:
>>>>   
>>>>> On 4/24/2018 3:10 AM, Alex Williamson wrote:  
>>>>>> On Wed, 18 Apr 2018 12:31:53 -0600
>>>>>> Alex Williamson <alex.williamson at redhat.com> wrote:
>>>>>>     
>>>>>>> On Mon,  9 Apr 2018 12:35:10 +0200
>>>>>>> Gerd Hoffmann <kraxel at redhat.com> wrote:
>>>>>>>    
>>>>>>>> This little series adds three drivers, for demo-ing and testing vfio
>>>>>>>> display interface code.  There is one mdev device for each interface
>>>>>>>> type (mdpy.ko for region and mbochs.ko for dmabuf).      
>>>>>>>
>>>>>>> Erik Skultety brought up a good question today regarding how libvirt is
>>>>>>> meant to handle these different flavors of display interfaces and
>>>>>>> knowing whether a given mdev device has display support at all.  It
>>>>>>> seems that we cannot simply use the default display=auto because
>>>>>>> libvirt needs to specifically configure gl support for a dmabuf type
>>>>>>> interface versus not having such a requirement for a region interface,
>>>>>>> perhaps even removing the emulated graphics in some cases (though I
>>>>>>> don't think we have boot graphics through either solution yet).
>>>>>>> Additionally, GVT-g seems to need the x-igd-opregion support
>>>>>>> enabled(?), which is a non-starter for libvirt as it's an experimental
>>>>>>> option!
>>>>>>>
>>>>>>> Currently the only way to determine display support is through the
>>>>>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on
>>>>>>> their own they'd need to get to the point where they could open the
>>>>>>> vfio device and perform the ioctl.  That means opening a vfio
>>>>>>> container, adding the group, setting the iommu type, and getting the
>>>>>>> device.  I was initially a bit appalled at asking libvirt to do that,
>>>>>>> but the alternative is to put this information in sysfs, but doing that
>>>>>>> we risk that we need to describe every nuance of the mdev device
>>>>>>> through sysfs and it becomes a dumping ground for every possible
>>>>>>> feature an mdev device might have.
>> ...    
>>>>>>> So I was ready to return and suggest that maybe libvirt should probe
>>>>>>> the device to know about these ancillary configuration details, but
>>>>>>> then I remembered that both mdev vGPU vendors had external dependencies
>>>>>>> to even allow probing the device.  KVMGT will fail to open the device
>>>>>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I
>>>>>>> believe, will fail if the vGPU manager process cannot find the QEMU
>>>>>>> instance to extract the VM UUID.  (Both of these were bad ideas)    
>>>>>>
>>>>>> Here's another proposal that's really growing on me:
>>>>>>
>>>>>>  * Fix the vendor drivers!  Allow devices to be opened and probed
>>>>>>    without these external dependencies.
>>>>>>  * Libvirt uses the existing vfio API to open the device and probe the
>>>>>>    necessary ioctls, if it can't probe the device, the feature is
>>>>>>    unavailable, ie. display=off, no migration.
>>>>>>     
>>>>>
>>>>> I'm trying to think simpler mechanism using sysfs that could work for
>>>>> any feature and knowing source-destination migration compatibility check
>>>>> by libvirt before initiating migration.
>>>>>
>>>>> I have another proposal:
>>>>> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES
>>>>> struct vfio_device_features {
>>>>>     __u32 argsz;
>>>>>     __u32 features;
>>>>> }
>>>>>
>>>>> Define bit for each feature:
>>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION	(1 << 0)
>>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF	(1 << 1)
>>>>> #define VFIO_DEVICE_FEATURE_MIGRATION		(1 << 2)
>>>>>
>>>>> * Vendor driver returns bitmask of supported features during
>>>>> initialization phase.
>>>>>
>>>>> * In vfio core module, trap this ioctl for each device  in
>>>>> vfio_device_fops_unl_ioctl(),  
>>>>
>>>> Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our
>>>> blocking point with mdev drivers, we can't get a device fd, so we can't
>>>> call an ioctl on the device fd.
>>>>   
>>>
>>> I'm sorry, I thought we could expose features when QEMU initialize, but
>>> libvirt needs to know supported features before QEMU initialize.
>>>
>>>
>>>>> check features bitmask returned by vendor
>>>>> driver and add a sysfs file if feature is supported that device. This
>>>>> sysfs file would return 0/1.  
>>>>
>>>> I don't understand why we have an ioctl interface, if the user can get
>>>> to the device fd then we have existing interfaces to probe these
>>>> things, it seems like you're just wanting to pass a features bitmap
>>>> through to vfio_add_group_dev() that vfio-core would expose through
>>>> sysfs, but a list of feature bits doesn't convey enough info except for
>>>> the most basic uses.
>>>>    
>>>
>>> Yes, vfio_add_group_dev() seems to be better way to convey features to
>>> vfio core.
>>>
>>>>> For migration this bit will only indicate if host driver supports
>>>>> migration feature.
>>>>>
>>>>> For source and destination compatibility check libvirt would need more
>>>>> data/variables to check like,
>>>>> * if same type of 'mdev_type' device create-able at destination,
>>>>>    i.e. if ('mdev_type'->available_instances > 0)
>>>>>
>>>>> * if host_driver_version at source and destination are compatible.
>>>>> Host driver from same release branch should be mostly compatible, but if
>>>>> there are major changes in structures or APIs, host drivers from
>>>>> different branches might not be compatible, for example, if source and
>>>>> destination are from different branches and one of the structure had
>>>>> changed, then data collected at source might not be compatible with
>>>>> structures at destination and typecasting it to changed structures would
>>>>> mess up migrated data during restoration.  
>>>>
>>>> Of course now you're asking that libvirt understand the release
>>>> versioning scheme of every vendor driver and that it remain
>>>> programatically consistent.  We can't even do this with in-kernel
>>>> drivers.  And in the end, still the best we can do is guess.
>>>>  
>>>
>>> Libvirt doesn't need to understand the version, libvirt need to do
>>> strcmp version string from source and destination. If those are equal,
>>> then libvirt would understand that they are compatible.
>>
>> Who's to say that the driver version and migration compatibility have
>> any relation at all?  Some drivers might focus on designing their own
>> migration interface that can maintain compatibility across versions
>> (QEMU does this), some drivers may only allow identical version
>> migration (which is going to frustrate upper level management tools and
>> customers - RHEL goes to great extents to support cross version
>> migration).  We cannot have a one size fits all here that driver version
>> defines completely the migration compatibility.
> 
> I'll agree; I don't know enough about these devices, but to give you
> some example of things I'd expect to work:
>    a) User adds new machines to their data centre with larger/newer
> version of the same vendors GPU; in some cases that should work
> (depending on vendor details etc)
>    b) The same thing but with identical hardware but a newer driver on
> the destination.
> 
> Obviously there will be some cut offs that say some versions are
> incompatible;  but for normal migration we jump through serious hoops
> to make sure stuff works; customers will expect the same with some
> VFIO devices.
> 

How libvirt checks that cut off where some versions are incompatible?

>>>>> * if guest_driver_version is compatible with host driver at destination.
>>>>> For mdev devices, guest driver communicates with host driver in some
>>>>> form. If there are changes in structures/APIs of such communication,
>>>>> guest driver at source might not be compatible with host driver at
>>>>> destination.  
>>>>
>>>> And another guess plus now the guest driver is involved which libvirt
>>>> has no visibility to.
>>>>    
>>>
>>> Like above libvirt need to do strcmp.
>>
>> Insufficient, imo
>>
>>>>> 'available_instances' sysfs already exist, later two should be added by
>>>>> vendor driver which libvirt can use for migration compatibility check.  
>>>>
>>>> As noted previously, display and migration are not necessarily
>>>> mdev-only features, it's possible that vfio-pci or vfio-platform could
>>>> also implement these, so the sysfs interface cannot be restricted to
>>>> the mdev template and lifecycle interface.
>>>>   
>>>
>>> I agree.
>>> Feature bitmask passed to vfio core is not mdev specific. But here
>>> 'available_instances' for migration compatibility check is mdev
>>> specific. If mdev device is not create-able at destination, there is no
>>> point in initiating migration by libvirt.
>>
>> 'available_instances' for migration compatibility check...?  We use
>> available_instances to know whether we have the resources to create a
>> given mdev type.  It's certainly a prerequisite to have a device of the
>> identical type at the migration target and how we define what is an
>> identical device for a directly assigned PCI device is yet another
>> overly complicated rat hole.  But an identical device doesn't
>> necessarily imply migration compatibility and I think that's the
>> problem we're tackling.  We cannot assume based only on the device type
>> that migration is compatible, that's basically saying we're never going
>> to have any bugs or oversights or new features in the migration stream.
> 
> Those things certainly happen; state that we forgot to transfer, new
> features enables on devices, devices configured in different ways.
> 

How libvirt checks migration compatibility for other devices across QEMU
versions where source support a device and destination running with
older QEMU version doesn't support that device or that device doesn't
exist in that system?

Thanks,
Kirti

>> Chatting with Laine, it may be worth a step back to include migration
>> experts and people up the stack with more visibility to how openstack
>> operates.  The issue here is that if vfio gains migration support then
>> we have a portion of the migration stream that is not under the control
>> of QEMU, we cannot necessarily tie it to a QEMU machine type and we
>> cannot necessarily dictate how the vfio bus driver (vendor driver)
>> handles versioning and compatibility.  My intent was to expose some
>> sort of migration information through the vfio API so that upper level
>> tools could determine source and target compatibility, but this in
>> itself is I think something new that those tools need to agree how it
>> might be done.  How would something like openstack want to handle not
>> only finding a migration target with a compatible device, but also
>> verifying if the device supports the migration format of the source
>> device?
>>
>> Alternatively, should we do anything?  Is the problem too hard and we
>> should let the driver return an error when it receives an incompatible
>> migration stream, aborting the migration?
> 
> It's a bit nasty; if you've hit the 'evacuate host' button then what
> happens when you've got some incompatible hosts.
> 
> Dave
> 
>>>> One more try... we have a vfio_group fd.  This is created by the bus
>>>> drivers calling vfio_add_group_dev() and registers a struct device, a
>>>> struct vfio_device_ops, and private data.  Typically we only wire the
>>>> device_ops to the resulting file descriptor we get from
>>>> VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl
>>>> through the group fd?  The ioctl would need to take a string arg to
>>>> match to a device name, plus an ioctl cmd and arg for the device_ops
>>>> ioctl.  The group ioctl would need to filter cmds to known, benign
>>>> queries.  We'd also need to verify that the allowed ioctls have no
>>>> dependencies on setup done in device_ops.open().  
>>>
>>> So these ioctls would be called without devices open() call, doesn't
>>> this seem to be against file operations standard?
>>
>> vfio_device_ops is modeled largely after file operations, but I don't
>> think we're bound by that for the interaction between vfio-core and the
>> vfio bus drivers.  We could make a separate callback for unprivileged
>> ioctls, but that seems like more work per driver when we really want to
>> maintain the identical API, we just want to provide a more limited
>> interface and change the calling point.
>>
>> An issue I thought of for migration though is that this path wouldn't
>> have access to the migration region and therefore if we place a header
>> within that region containing the compatibility and versioning
>> information, the user still couldn't access it.  This doesn't seem to
>> be a blocker though as we could put that information within the region
>> capability that defines the region as used for migration.  Possibly a
>> device could have multiple migration regions with different formats
>> for backwards compatibility, of course then we'd need a way to
>> determine which to use and which combinations have been validated.
>> Thanks,
>>
>> Alex
> --
> Dr. David Alan Gilbert / dgilbert at redhat.com / Manchester, UK
>