[libvirt] [RFC v2] libvirt vGPU QEMU integration

Alex Williamson alex.williamson at redhat.com
Tue Sep 20 14:43:23 UTC 2016


On Tue, 20 Sep 2016 19:51:58 +0530
Kirti Wankhede <kwankhede at nvidia.com> wrote:

> On 9/20/2016 3:06 AM, Alex Williamson wrote:
> > On Tue, 20 Sep 2016 02:05:52 +0530
> > Kirti Wankhede <kwankhede at nvidia.com> wrote:
> >   
> >> Hi libvirt experts,
> >>
> >> Thanks for valuable input on v1 version of RFC.
> >>
> >> Quick brief, VFIO based mediated device framework provides a way to
> >> virtualize their devices without SR-IOV, like NVIDIA vGPU, Intel KVMGT
> >> and IBM's channel IO. This framework reuses VFIO APIs for all the
> >> functionalities for mediated devices which are currently being used for
> >> pass through devices. This framework introduces a set of new sysfs files
> >> for device creation and its life cycle management.
> >>
> >> Here is the summary of discussion on v1:
> >> 1. Discover mediated device:
> >> As part of physical device initialization process, vendor driver will
> >> register their physical devices, which will be used to create virtual
> >> device (mediated device, aka mdev) to the mediated framework.
> >>
> >> Vendor driver should specify mdev_supported_types in directory format.
> >> This format is class based, for example, display class directory format
> >> should be as below. We need to define such set for each class of devices
> >> which would be supported by mediated device framework.
> >>
> >>  --- mdev_destroy  
> > 
> > I thought we replaced mdev_destroy with a remove attribute for each
> > mdev device.
> >   
> >>  --- mdev_supported_types
> >>      |-- 11
> >>      |   |-- create
> >>      |   |-- name
> >>      |   |-- fb_length
> >>      |   |-- resolution
> >>      |   |-- heads
> >>      |   |-- max_instances
> >>      |   |-- params
> >>      |   |-- requires_group
> >>      |-- 12
> >>      |   |-- create
> >>      |   |-- name
> >>      |   |-- fb_length
> >>      |   |-- resolution
> >>      |   |-- heads
> >>      |   |-- max_instances
> >>      |   |-- params
> >>      |   |-- requires_group
> >>      |-- 13
> >>          |-- create
> >>          |-- name
> >>          |-- fb_length
> >>          |-- resolution
> >>          |-- heads
> >>          |-- max_instances
> >>          |-- params
> >>          |-- requires_group
> >>
> >>
> >> In the above example directory '11' represents a type id of mdev device.
> >> 'name', 'fb_length', 'resolution', 'heads', 'max_instance' and
> >> 'requires_group' would be Read-Only files that vendor would provide to
> >> describe about that type.
> >>
> >> 'create':
> >>     Write-only file. Mandatory.
> >>     Accepts string to create mediated device.
> >>
> >> 'name':
> >>     Read-Only file. Mandatory.
> >>     Returns string, the name of that type id.
> >>
> >> 'fb_length':
> >>     Read-only file. Mandatory.
> >>     Returns <number>{K,M,G}, size of framebuffer.  
> > 
> > This can't be mandatory, it's only relevant to vGPU devices, vGPUs are
> > just one user of mediated devices.
> >   
> 
> As per our discussion in BOF, we would define class and its attributes.
> As I mentioned above these are attributes of "display" class.

Just as I didn't know here, how does libvirt know the class of a given
type id?
 
> >>
> >> 'resolution':
> >>     Read-Only file. Mandatory.
> >>     Returns 'hres x vres' format. Maximum supported resolution.  
> > 
> > Same.
> >   
> >>
> >> 'heads':
> >>     Read-Only file. Mandatory.
> >>     Returns integer. Number of maximum heads supported.  
> > 
> > Same.
> >   
> >>
> >> 'max_instance':
> >>     Read-Only file. Mandatory.
> >>     Returns integer.  Returns maximum mdev device could be created
> >> at the moment when this file is read. This count would be updated by
> >> vendor driver. Before creating mdev device of this type, check if
> >> max_instance is > 0.  
> > 
> > Didn't we discuss this being being something like "available_instances"
> > with a "devices" sub-directory linking current devices of that type?
> >  
> 
> I'm ok with 'available_instances' as name of this file.
> Yes, mdev device directory will have links to devices of that type. I
> think that is part of mdev core module discussion. I'm trying to list
> sysfs files for libvirt interface here to focus more on libvirt
> interface. Hence didn't add that. I'll make sure to add all such details
> in future.
> 
> 
> >> 'params'
> >>     Write-Only file. Optional.
> >>     String input. Libvirt would pass the string given in XML file to
> >> this file and then create mdev device. Set empty string to clear params.
> >> For example, set parameter 'frame_rate_limiter=0' to disable frame rate
> >> limiter for performance benchmarking, then create device of type 11. The
> >> device created would have that parameter set by vendor driver.  
> > 
> > Might as well just call it "magic", there's absolutely no ability to
> > introspect what parameters are allowed or even valid here.  
> 
> Libvirt don't need to introspect these parameters. These are meant for
> vendor driver. Let vendor driver interpret these parameters and take
> action based on that.

Doesn't matter, there's clearly as strong feeling that libvirt wants
nothing to do with passing opaque vendor defined strings to a device.
Paolo suggested these could be attributed on the mdev device itself
that could be set by some 3rd party.  I suggested that these could be
module options on the vendor driver whether to expose types with these
properties.  There are other, more reasonable ways to do this.
 
> >  Can all parameters be changed dynamically?  
> 
> Parameters here would be extra parameters which are not listed above in
> mandatory list. If such parameters are required to set, these parameters
> should be set per mdev device before VM is booted.
> 
> >  Do parameter changes apply to existing devices or only future devices?  
> 
> No, it should be applied at the time when mdev device is created or
> after mdev device is created but before VM is booted. It will not be
> applicable to existing devices.
> 
> >  This is a poorly specified
> > interface.  I'd prefer this be done via module options on the vendor
> > driver.
> >   
> 
> Module option is not suitable here, in that case such parameters would
> be applied to all mdev device created for vendor driver and that is not
> what we want to.

Then use attributes on the mdev device itself, this is not an
acceptable interface.

> >> 'requires_group'
> >>     Read-Only file. Optional.
> >>     This should be provided by vendor driver if vendor driver need to
> >> group mdev devices in one domain so that vendor driver can use 'first
> >> open' to commit resources of all mdev devices associated to that domain
> >> and 'last close' to free those.
> >>
> >> The parent device would look like:
> >>
> >>    <device>
> >>      <name>pci_0000_86_00_0</name>
> >>      <capability type='pci'>
> >>        <domain>0</domain>
> >>        <bus>134</bus>
> >>        <slot>0</slot>
> >>        <function>0</function>  
> > 
> > This is the parent device address?  (I think we'd re-use the
> > specification for assigned devices)  Is this optional?  Couldn't
> > libvirt choose to pick any parent device supplying the specified type
> > if not supplied?
> >   
> >>        <capability type='mdev'>
> >>          <!-- one type element per sysfs directory -->
> >>          <type id='11'>
> >>            <!-- one element per sysfs file roughly -->
> >>            <name>GRID M60-0B</name>
> >>            <attribute name='fb_length'>512M</attribute>
> >>            <attribute name='resolution'>2560x1600</attribute>
> >>            <attribute name='heads'>2</attribute>
> >>            <attribute name='max_instances'>16</attribute>
> >>            <attribute name='requires_group'>1</attribute>
> >>          </type>
> >>        </capability>
> >>        <product id='...'>GRID M60</product>
> >>        <vendor id='0x10de'>NVIDIA</vendor>  
> > 
> > What are these specifying?
> >   
> >>      </capability>
> >>    </device>
> >>
> >> 2. Create/destroy mediated device
> >>
> >> With above example, vGPU device XML would look like:
> >>
> >>    <device>
> >>      <name>my-vgpu</name>
> >>      <parent>pci_0000_86_00_0</parent>
> >>      <capability type='mdev'>
> >>        <type id='11'/>
> >>        <group>1</group>
> >>        <params>'frame_rate_limiter=0'</params>
> >>      </capability>
> >>    </device>
> >>
> >> 'type id' is mandatory.  
> > 
> > I was under the impression that the vendor supplied "name" was the one
> > unique identifier.  We're certainly not going to create a registrar to
> > hand out type ids to each vendor, so what makes type id unique?  
> 
> Type id is unique, 'name' is the name (string) of device that would be
> displayed in guest OS for that type of mdev device.

We were told at the BoF that name was the unique string which would be
consistent and you again indicate below that "GRID-M60-0B" has a fixed
set of attributes, so why do we care what type id is associated with
that name?
 
> >  I have
> > a hard time believing that a given vendor can even allocate unique type
> > ids for their own devices.  Unique type id across vendors is not
> > practical.  So which attribute are we actually using to identify which
> > type of mdev device we need and why would we ever specify additional
> > attributes like fb_length?  Doesn't the vendor guarantee that "GRID
> > M60-0B" has a fixed setup of those attributes?
> >   
> 
> Specifying attributes here is not our requirement. Yes we have fixed set
> of attributes for "GRID-M60-0B" and on.
> We are defining the attributed here for "display" class for all other
> vendor of gpu can use.
> 
> 
> >> 'group' is optional. It should be a unique number in the system among
> >> all the groups created for mdev devices. Its usage is:
> >>   - not needed if single vGPU device is being assigned to a domain.
> >>   - only need to be set if multiple vGPUs need to be assigned to a
> >> domain and vendor driver have 'requires_group' file in type id directory.
> >>   - if type id directory include 'requires_group' and user tries to
> >> assign multiple vGPUs to a domain without having <group> field in XML,
> >> it will create single vGPU.  
> > 
> > We never finished our discussion of how this gets implemented or
> > whether a group applies only to devices from the same vendor, same
> > device, how heterogeneous groups are handled, etc.
> >   
> 
> We agreed on above points that I mentioned here and we wish to know what
> libvirt folks think, right?
> Since there were no inputs on that thread I thought I should summarize
> what had been discussed and get libvirt experts thoughts on this.

No, that was just an idea that was never full defined.  We had that
idea, the IOMMU group idea, the invalid container idea...  This is
still an open question in the design.

> >> 'params' is optional field. User should set this field if extra
> >> parameters need to be set for a particular vGPU device. Libvirt don't
> >> need to parse these params. These are meant for vendor driver.  
> > 
> > ie. magic black hole.  nope.
> >   
> >> Libvirt need to follow the sequence to create device:
> >> * Read /sys/../0000\:86\:00.0/11/max_instances. If it is greater than 0,
> >> then only proceed else fail.
> >>
> >> * Set extra params if 'params' field exist in device XML and 'params'
> >> file exist in type id directory
> >>
> >>     echo "frame_rate_limiter=0" > /sys/../0000\:86\:00.0/11/params
> >>
> >> * Autogenerate UUID
> >> * Create device:
> >>
> >>     echo "$UUID:<group>" > /sys/../0000\:86\:00.0/11/create
> >>
> >>     where <group> is optional. Group should be unique number among all
> >> the groups created for mdev devices.  
> > 
> > Is it an integer, a UUID, a string?  How is the group for an mdev
> > device discovered?  Can it be changed later?
> >   
> 
> Group number could be integer, that's where we need input from libvirt
> folks. It should be provided at the time of mdev device creation. It
> can't be changed after that. If user want to change group number
> destroy/remove mdev device and create it again with proper group number.

My concern is that a type id seems arbitrary but we're specifying that
it be unique.  We already have something unique, the name.  So why try
to make the type id unique as well?  A vendor can accidentally create
their vendor driver so that a given name means something very
specific.  On the other hand they need to be extremely deliberate to
coordinate that a type id means a unique thing across all their product
lines.

> >> * Clear params, if set earlier:
> >>
> >>     echo "" > /sys/../0000\:86\:00.0/11/params  
> > 
> > So params only applies to the devices created while those params are
> > set?  This is inherently racy.
> >   
> >> * To destroy device:
> >>
> >>     echo $UUID > /sys/../0000\:86\:00.0/mdev_destroy  
> > 
> > echo 1 > /sys/path/to/mdev/device/remove
> >    
> >> 3. Start/stop mediated device
> >>
> >> No change or requirement for libvirt as this will be handled by open()
> >> and close() callbacks to vendor driver. In case of multiple devices and
> >> 'requires_group' set, this will be handled in 'first open()' and 'last
> >> close()' on device in that group.
> >>
> >> 4. Launch QEMU/VM
> >>
> >>  Pass the mdev sysfs path to QEMU as vfio-pci device.
> >>  For above vGPU device example:
> >>
> >>     -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID
> >>
> >> 5. QEMU/VM Shutdown sequence
> >>
> >> No change or requirement for libvirt.
> >>
> >> 6. VM Reset
> >>
> >> No change or requirement for libvirt as this will be handled via VFIO
> >> reset API and QEMU process will keep running as before.
> >>
> >> 7. Hot-plug
> >>
> >> It is same syntax to create a virtual device for hot-plug.  
> > 
> > How do groups work with hotplug?  Can a device be creating into an
> > existing, running group?  Can a device be removed from an existing,
> > running group?  Thanks,
> >   
> 
> Group is for vendor driver to decide when to commit resources for a
> group of devices to support multiple devices in a domain. It will be
> upto vendor driver how it manage it.

Then it's not sufficiently specified for libvirt to use.  Thanks,

Alex




More information about the libvir-list mailing list