[libvirt] [PATCH v7 0/4] Add Mediated device support

Thu Sep 1 04:09:40 UTC 2016

> From: Alex Williamson [mailto:alex.williamson at redhat.com]
> Sent: Wednesday, August 31, 2016 11:49 PM
> 
> On Wed, 31 Aug 2016 15:04:13 +0800
> Jike Song <jike.song at intel.com> wrote:
> 
> > On 08/31/2016 02:12 PM, Tian, Kevin wrote:
> > >> From: Alex Williamson [mailto:alex.williamson at redhat.com]
> > >> Sent: Wednesday, August 31, 2016 12:17 AM
> > >>
> > >> Hi folks,
> > >>
> > >> At KVM Forum we had a BoF session primarily around the mediated device
> > >> sysfs interface.  I'd like to share what I think we agreed on and the
> > >> "problem areas" that still need some work so we can get the thoughts
> > >> and ideas from those who weren't able to attend.
> > >>
> > >> DanPB expressed some concern about the mdev_supported_types sysfs
> > >> interface, which exposes a flat csv file with fields like "type",
> > >> "number of instance", "vendor string", and then a bunch of type
> > >> specific fields like "framebuffer size", "resolution", "frame rate
> > >> limit", etc.  This is not entirely machine parsing friendly and sort of
> > >> abuses the sysfs concept of one value per file.  Example output taken
> > >> from Neo's libvirt RFC:
> > >>
> > >> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> > >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> > >> max_resolution
> > >> 11      ,"GRID M60-0B",      16,       2,      45,     512M,    2560x1600
> > >> 12      ,"GRID M60-0Q",      16,       2,      60,     512M,    2560x1600
> > >> 13      ,"GRID M60-1B",       8,       2,      45,    1024M,    2560x1600
> > >> 14      ,"GRID M60-1Q",       8,       2,      60,    1024M,    2560x1600
> > >> 15      ,"GRID M60-2B",       4,       2,      45,    2048M,    2560x1600
> > >> 16      ,"GRID M60-2Q",       4,       4,      60,    2048M,    2560x1600
> > >> 17      ,"GRID M60-4Q",       2,       4,      60,    4096M,    3840x2160
> > >> 18      ,"GRID M60-8Q",       1,       4,      60,    8192M,    3840x2160
> > >>
> > >> The create/destroy then looks like this:
> > >>
> > >> echo "$mdev_UUID:vendor_specific_argument_list" >
> > >> 	/sys/bus/pci/devices/.../mdev_create
> > >>
> > >> echo "$mdev_UUID:vendor_specific_argument_list" >
> > >> 	/sys/bus/pci/devices/.../mdev_destroy
> > >>
> > >> "vendor_specific_argument_list" is nebulous.
> > >>
> > >> So the idea to fix this is to explode this into a directory structure,
> > >> something like:
> > >>
> > >> ├── mdev_destroy
> > >> └── mdev_supported_types
> > >>     ├── 11
> > >>     │   ├── create
> > >>     │   ├── description
> > >>     │   └── max_instances
> > >>     ├── 12
> > >>     │   ├── create
> > >>     │   ├── description
> > >>     │   └── max_instances
> > >>     └── 13
> > >>         ├── create
> > >>         ├── description
> > >>         └── max_instances
> > >>
> > >> Note that I'm only exposing the minimal attributes here for simplicity,
> > >> the other attributes would be included in separate files and we would
> > >> require vendors to create standard attributes for common device classes.
> > >
> > > I like this idea. All standard attributes are reflected into this hierarchy.
> > > In the meantime, can we still allow optional vendor string in create
> > > interface? libvirt doesn't need to know the meaning, but allows upper
> > > layer to do some vendor specific tweak if necessary.
> > >
> >
> > Not sure whether this can done within MDEV framework (attrs provided by
> > vendor driver of course), or must be within the vendor driver.
> 
> The purpose of the sub-directories is that libvirt doesn't need to pass
> arbitrary, vendor strings to the create function, the attributes of the
> mdev device created are defined by the attributes in the sysfs
> directory where the create is done.  The user only provides a uuid for
> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> need to know the meaning, but would need to know when to apply them,
> which is just as bad.  Ultimately we want libvirt to be able to
> interact with sysfs without having an vendor specific knowledge.

Understand. Today Intel doesn't have such vendor specific parameter
requirement when creating a mdev instance (assuming type definition
is enough to cover our existing parameters).

Just think about future extensibility. Say if a new parameter (say
a QoS parameter like weight or cap) must be statically set before 
created mdev instance starts to work, due to device limitation, such
parameter needs to be exposed as a new attribute under the specific 
mdev instance, e.g.:
	/sys/bus/pci/devices/<sbdf>/mdev/weight

Then libvirt needs to make sure it's set before open() the instance.

If such flow is acceptable, it should remove necessity of vendor specific
parameter at the create, because any such requirement should be 
converted into sysfs node, if applicable to all vendors, then libvirt
can do asynchronous configurations before starting the instance.

> 
> > >>
> > >> For vGPUs like NVIDIA where we don't support multiple types
> > >> concurrently, this directory structure would update as mdev devices are
> > >> created, removing no longer available types.  I carried forward
> > >
> > > or keep the type with max_instances cleared to ZERO.
> > >
> >
> > +1 :)
> 
> Possible yes, but why would the vendor driver report types that the
> user cannot create?  It just seems like superfluous information (well,
> except for the use I discover below).

If we consider using available_instances as you suggested later, this way
is simpler since libvirt only needs to scan available types once, w/o need
to differentiate whether a specific vendor allows only one type or 
multiple types. :-)

> 
> > >> max_instances here, but perhaps we really want to copy SR-IOV and
> > >> report a max and current allocation.  Creation and deletion is
> > >
> > > right, cur/max_instances look reasonable.
> > >
> > >> simplified as we can simply "echo $UUID > create" per type.  I don't
> > >> understand why destroy had a parameter list, so here I imagine we can
> > >> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> > >> entry under each mdev device, so we remove it at the device rather than
> > >> in some central location (any objections?).
> > >
> > > OK to me.
> >
> > IIUC, "destroy" has a parameter list is only because the previous
> > $VM_UUID + instnace implementation. It should be safe to move the "destroy"
> > file under mdev now.
> >
> > >> We discussed how this might look with Intel devices which do allow
> > >> mixed vGPU types concurrently.  We believe, but need confirmation, that
> > >> the vendor driver could still make a finite set of supported types,
> > >> perhaps with additional module options to the vendor driver to enable
> > >> more "exotic" types.  So for instance if IGD vGPUs are based on
> > >> power-of-2 portions of the framebuffer size, then the vendor driver
> > >> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> > >> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.
> > >
> > > Yes, Intel can do such type of definition. One thing I'm not sure is
> > > about impact cross listed types, i.e. when creating a new instance
> > > under a given type, max_instances under other types would be
> > > dynamically decremented based on available resource. Would it be
> > > a problem for libvirt or upper level stack, since a natural interpretation
> > > of max_instances should be a static number?
> > >
> > > An alternative is to make max_instances configurable, so libvirt has
> > > chance to define a pool of available instances with different types
> > > before creating any instance. For example, initially IGD driver may
> > > report max_instances only for a minimal sharing granularity:
> > > 	128MB:
> > > 		max_instances (8)
> > > 	256MB:
> > > 		max_instances (0)
> > > 	512MB:
> > > 		max_instances (0)
> > >
> > > Then libvirt can configure more types as:
> > > 	128MB:
> > > 		max_instances (2)
> > > 	256MB:
> > > 		max_instances (1)
> > > 	512MB:
> > > 		max_instances (1)
> > >
> > > Starting from this point, max_instances would be static and then
> > > mdev instance can be created under each type. But I'm not
> > > sure whether such additional configuration role is reasonable to libvirt...
> 
> My expectation of your example, where I'm assuming you have 1G of total
> memory that can be divided between the mdev devices would be:
> 
>  128M: 8
>  256M: 4
>  512M: 2
> 
> If a 512M mdev device is created, this becomes:
> 
>  128M: 4
>  256M: 2
>  512M: 1
> 
> Creating a 128M mdev device from that becomes:
> 
>  128M: 3
>  256M: 1
>  512M: 0
> 
> It's not great, but I don't know how to do it better without the user
> having a clear understanding of the algorithm and resources required
> for each mdev device.  For instance, the size here, presumably the
> framebuffer size, is just one attribute in the device directory, the
> user won't know that this attribute is the key to the available
> instances.

Above is just one example. We may provide types described as:
"small", "medium" and "large", each with a description of available
resources, like framebuffer size, default weight, etc. But the
rationale is same, that creating instance under one type may impact
available instances under other types.

> 
> I don't particularly like the idea of a writeable max_instances, the
> user can simply create instances of the type and see the results.
> 
> Just thought of another thing; do we need some way to determine the
> type of an mdev device from sysfs or is this implicit knowledge for the
> user that created the device?  For instance, we create a 512M device
> and it becomes a child device to the parent, so we can associate to the
> parent, but if we come back later, how do we know it's a 512M device?
> Perhaps this is a reason to keep the type directories around and we can
> cross link the device to the type and create a devices subdirectory
> under each type.  

yes, we can have a hierarchy like below:

  	/sys/bus/pci/devices/<sbdf>/mdev/
  	|-- uuid1/
  	|   `-- type (->/sys/bus/pci/devices/<sbdf>/types/12)
  	`-- uuid2/
  	    `-- type (->/sys/bus/pci/devices/<sbdf>/types/13)

	/sys/bus/pci/devices/<sbdf>/types/12/
  	|-- create
  	|-- description
  	|-- available_instances
  	|-- devices
  	    `-- uuid1 (->/sys/bus/pci/devices/<sbdf>/mdev/uuid1)

	/sys/bus/pci/devices/<sbdf>/types/13/
  	|-- create
  	|-- description
  	|-- available_instances
  	|-- devices
  	    `-- uuid2 (->/sys/bus/pci/devices/<sbdf>/mdev/uuid2)

> Perhaps then "max_instances" becomes
> "available_instances" (ie. how many left we can create) and we don't
> need a "current_instances" because we can simply look in the devices
> directory.

It's a nice idea.

> 
> > >>
> > >> We still don't have any way for the admin to learn in advance how the
> > >> available supported types will change once mdev devices start to be
> > >> created.  I'm not sure how we can create a specification for this, so
> > >> probing by creating devices may be the most flexible model.
> > >>
> > >> The other issue is the start/stop requirement, which was revealed to
> > >> setup peer-to-peer resources between vGPUs which is a limited hardware
> > >> resource.  We'd really like to have these happen automatically on the
> > >> first open of a vfio mdev device file and final release.  So we
> > >> brainstormed how the open/release callbacks could know the other mdev
> > >> devices for a given user.  This is where the instance number came into
> > >> play previously.  This is an area that needs work.
> > >
> > > IGD doesn't have such peer-to-peer resource setup requirement. So
> > > it's sufficient to create/destroy a mdev instance in a single action on
> > > IGD. However I'd expect we still keep the "start/stop" interface (
> > > maybe not exposed as sysfs node, instead being a VFIO API), as
> > > required to support future live migration usage. We've made prototype
> > > working for KVMGT today.
> 
> Great!
> 
> > It's good for the framework to define start/stop interfaces, but as Alex
> > said below, it should be MDEV oriented, not VM oriented.
> >
> > I don't know a lot about the peer-to-peer resource, but to me, although
> > VM_UUID + instance is not applicable, userspace can always achieve the
> > same purpose by, let us assume a mdev hierarchy, providing the VM UUID
> > under every mdev:
> >
> > 	/sys/bus/pci/devices/<sbdf>/mdev/
> > 	|-- mdev01/
> > 	|   `-- vm_uuid
> > 	`-- mdev02/
> > 	    `-- vm_uuid
> >
> > Did I miss something?
> 
> Sure, this is just another way of doing UUID+instance.  Nit, it might
> look more like:
> 
>  	/sys/bus/pci/devices/<sbdf>/mdev/
>  	|-- uuid1/
>  	|   `-- group_uuid
>  	`-- uuid2/
>  	    `-- group_uuid
> 
> Where each mdev device is actually referenced by its UUID name then
> we'd have some writable attribute under the device where mdev devices
> sharing the same group UUID are handled together.  There's a problem
> here though that vfio doesn't know about this level of grouping, so
> uuid1 and uuid2 could actually be given to different users despite the
> grouping here, which results in one or both devices not working or
> creating security issues.  That sort of implies that this would
> necessarily need to be exposed as iommu grouping.  This factors into why
> it seems like a good idea to make the start/stop implicit within the
> interface.  In that way each mdev device is fungible as far as a user
> like libvirt is concerned, internal details like peer-to-peer resources
> are handled automatically as the devices are accessed.

Such group knowledge comes from user. I'm not sure whether IOMMU group
logic allows user to create/define group today. Is it better to just create a
mdev group concept within VFIO scope?

  	/sys/bus/pci/devices/<sbdf>/mdev/
  	|-- uuid1/
  	|   `-- group_uuid0
  	`-- uuid2/
  	    `-- group_uuid0

	/sys/bus/pci/devices/<sbdf>/mdev/groups/
  	|-- 0/
  	|   `-- uuid1
  	    `-- uuid2

User is expected to setup group before opening any mdev instance within
that group. This way it should be easy for VFIO to start all instances 
within same group upon the 1st open() in this group.

Thanks
Kevin