[libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

Tue Sep 6 21:28:58 UTC 2016

On Wed, 7 Sep 2016 01:05:11 +0530
Kirti Wankhede <kwankhede at nvidia.com> wrote:

> On 9/6/2016 11:10 PM, Alex Williamson wrote:
> > On Sat, 3 Sep 2016 22:04:56 +0530
> > Kirti Wankhede <kwankhede at nvidia.com> wrote:
> >   
> >> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:  
> >>>
> >>>
> >>> On 02/09/2016 20:33, Kirti Wankhede wrote:    
> >>>> <Alex> We could even do:    
> >>>>>>
> >>>>>> echo $UUID1:$GROUPA > create
> >>>>>>
> >>>>>> where $GROUPA is the group ID of a previously created mdev device into
> >>>>>> which $UUID1 is to be created and added to the same group.    
> >>>> </Alex>    
> >>>
> >>> From the point of view of libvirt, I think I prefer Alex's idea.
> >>> <group> could be an additional element in the nodedev-create XML:
> >>>
> >>>     <device>
> >>>       <name>my-vgpu</name>
> >>>       <parent>pci_0000_86_00_0</parent>
> >>>       <capability type='mdev'>
> >>>         <type id='11'/>
> >>>         <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
> >>>         <group>group1</group>
> >>>       </capability>
> >>>     </device>
> >>>
> >>> (should group also be a UUID?)
> >>>     
> >>
> >> No, this should be a unique number in a system, similar to iommu_group.  
> > 
> > Sorry, just trying to catch up on this thread after a long weekend.
> > 
> > We're talking about iommu groups here, we're not creating any sort of
> > parallel grouping specific to mdev devices.  
> 
> I thought we were talking about group of mdev devices and not iommu
> group. IIRC, there were concerns about it (this would be similar to
> UUID+instance) and that would (ab)use iommu groups.

What constraints does a group, which is not an iommu group, place on the
usage of the mdev devices?  What happens if we put two mdev devices in
the same "mdev group" and then assign them to separate VMs/users?  I
believe that the answer is that this theoretical "mdev group" doesn't
actually impose any constraints on the devices within the group or how
they're used.

vfio knows about iommu groups and we consider an iommu group to be the
unit of ownership for userspace.  Therefore by placing multiple mdev
devices within the same iommu group we can be assured that there's only
one user for that group.  Furthermore, the specific case for this
association on NVIDIA is to couple the hardware peer-to-peer resources
for the individual mdev devices.  Therefore this particular grouping
does imply a lack of isolation between those mdev devices involved in
the group.

For mdev devices which are actually isolated from one another, where
they don't poke these p2p holes, placing them in the same iommu group
is definitely an abuse of the interface and is going to lead to
problems with a single iommu context.  But how does libvirt know that
one type of mdev device needs to be grouped while another type doesn't?

There's really not much that I like about using iommu groups in this
way, it's just that they seem to solve this particular problem of
enforcing how such a group can be used and imposing a second form of
grouping onto the vfio infrastructure seems much too complex.

> I'm thinking about your suggestion, but would also like to know your
> thought how sysfs interface would look like? Its still no clear to me.
> Or will it be better to have grouping at mdev layer?

In previous replies I had proposed that a group could be an additional
argument when we write the mdev UUID to the create entry in sysfs.
This is specifically why I listed only the UUID when creating the first
mdev device and UUID:group when creating the second.  The user would
need to go determine the group ID allocated for the first entry to
specify creating the second within that same group.

I have no love for this proposal, it's functional but not elegant and
again leaves libvirt lost in trying to determine which devices need to
be grouped together and which have no business being grouped together.

Let's think through this further and let me make a couple assumptions
to get started:

1) iommu groups are the way that we want to group NVIDIA vGPUs because:
  a) The peer-to-peer resources represent an isolation gap between
     mdev devices, iommu groups represent sets of isolated devices.
  b) The 1:1 mapping of an iommu group to a user matches the NVIDIA
     device model.
  c) iommu_group_for_each_dev() gives the vendor driver the
     functionality it needs to perform a first-open/last-close
     device walk for configuring these p2p resources.

2) iommu groups as used by mdev devices should contain the minimum
number of devices in order to provide the maximum iommu context
flexibility.

Do we agree on these?  The corollary is that NVIDIA is going to suffer
reduced iommu granularity exactly because of the requirement to setup
p2p resources between mdev devices within the same VM.  This has
implications when guest iommus are in play (viommu).

So by default we want an iommu group per mdev.  This works for all mdev
devices as far as we know, including NVIDIA with the constraint that we
only have a single NVIDIA device per VM.

What if we want multiple NVIDIA devices?  We either need to create the
additional devices with a property which will place them into the same
iommu group or allow the iommu groups to be manipulated dynamically.

The trouble I see with the former (creating a device into a group) is
that it becomes part of the "create" syntax, which is global for all
mdev devices.  It's the same functional, but non-elegant solution I
proposed previously.

What if we allow groups to be manipulated dynamically?  In this case I
envision an attribute under the mdev device with read/write access.
The existence of the attribute indicates to libvirt that this device
requires such handling and allows reading and setting the association.
To be clear, the attribute would only exist on mdev devices requiring
this handling.  I'm always a fan of naming things after what they do, so
rather than making this attribute reference an iommu group, I might
actually call it "peer_to_peer_resource_uuid".  So the process might
look something like this:

# create 2 mdev devices
echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create
echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create

# move $UUID1 to the same group as $UUID0
P2P_UUID=$(cat /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid)
echo $P2P_UUID > \
    /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid

Alternatively we could have used uuidgen to create a UUID then moved
both to the new UUID.

Within the mdev vendor driver this would walk through the mdev devices,
find the matching peer_to_peer_resource_uuid (generated randomly at
create time by default) and add the device to the iommu group for
devices sharing that p2p uuid.  When removed from the VM, libvirt could
simply echo the output of uuidgen to each to split them again.

So from a libvirt perspective, special handling would need to invoked
that when this p2p attribute is found, all devices for a given VM would
need to share the same p2p uuid.  libvirt would be free to use an
existing p2p uuid or generate a new one.  The vendor driver should
enforce a write failure if the device cannot be added to the p2p uuid
(for example devices within the p2p uuid are already opened).

Maybe this is similar to your proposal and even goes back to vm_uuid,
but under the covers the vendor driver needs to be manipulating iommu
grouping based on this parameter and there's no concept of an "mdev
group" in the base API (nor vm_uuid), this is an extension keyed by the
additional sysfs attribute.

Are we getting closer?  Thanks,

Alex