[libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

Kirti Wankhede kwankhede at nvidia.com
Sat Sep 3 16:31:13 UTC 2016



On 9/3/2016 1:59 AM, John Ferlan wrote:
> 
> 
> On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
>>
>> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/09/2016 19:15, Kirti Wankhede wrote:
>>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>>>    <device>
>>>>>      <name>my-vgpu</name>
>>>>>      <parent>pci_0000_86_00_0</parent>
>>>>>      <capability type='mdev'>
>>>>>        <type id='11'/>
>>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>      </capability>
>>>>>    </device>
>>>>>
>>>>> After creating the vGPU, if required by the host driver, all the other
>>>>> type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
>>>>
>>>> Thanks Paolo for details.
>>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>>>> file in sysfs to create mdev device. Right?
>>>> At this moment, does libvirt know which VM this device would be
>>>> associated with?
>>>
>>> No, the VM will associate to the nodedev through the UUID.  The nodedev
>>> is created separately from the VM.
>>>
>>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>>>> info, again taken from sysfs:
>>>>>
>>>>>    <device>
>>>>>      <name>my-vgpu</name>
>>>>>      <parent>pci_0000_86_00_0</parent>
>>>>>      <capability type='mdev'>
>>>>>        <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>        <!-- only the chosen type -->
>>>>>        <type id='11'>
>>>>>          <!-- ... snip ... -->
>>>>>        </type>
>>>>>        <capability type='pci'>
>>>>>          <!-- no domain/bus/slot/function of course -->
>>>>>          <!-- could show whatever PCI IDs are seen by the guest: -->
>>>>>          <product id='...'>...</product>
>>>>>          <vendor id='0x10de'>NVIDIA</vendor>
>>>>>        </capability>
>>>>>      </capability>
>>>>>    </device>
>>>>>
>>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>>>> pci at all, would have it inside mdev.  This represents the difference
>>>>> between the mdev provider and the mdev device.
>>>>
>>>> Parent of mdev device might not always be a PCI device. I think we
>>>> shouldn't consider it as PCI capability.
>>>
>>> The <capability type='pci'> in the vGPU means that it _will_ be exposed
>>> as a PCI device by VFIO.
>>>
>>> The <capability type='pci'> in the physical GPU means that the GPU is a
>>> PCI device.
>>>
>>
>> Ok. Got that.
>>
>>>>> Random proposal for the domain XML too:
>>>>>
>>>>>   <hostdev mode='subsystem' type='pci'>
>>>>>     <source type='mdev'>
>>>>>       <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
>>>>>       <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
>>>>>     </source>
>>>>>     <address type='pci' bus='0' slot='2' function='0'/>
>>>>>   </hostdev>
>>>>>
>>>>
>>>> When user wants to assign two mdev devices to one VM, user have to add
>>>> such two entries or group the two devices in one entry?
>>>
>>> Two entries, one per UUID, each with its own PCI address in the guest.
>>>
>>>> On other mail thread with same subject we are thinking of creating group
>>>> of mdev devices to assign multiple mdev devices to one VM.
>>>
>>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
>>> other thread).
>>>
>>
>> When mdev device is created, resources from physical device is assigned
>> to this device. But resources are committed only when device goes
>> 'online' ('start' in v6 patch)
>> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
>> for all vGPU devices in a VM are committed at one place. So we need to
>> know the vGPUs assigned to a VM before QEMU starts.
>>
>> Grouping would help here as Alex suggested in that mail. Pulling only
>> that part of discussion here:
>>
>> <Alex> It seems then that the grouping needs to affect the iommu group
>> so that
>>> you know that there's only a single owner for all the mdev devices
>>> within the group.  IIRC, the bus drivers don't have any visibility
>>> to opening and releasing of the group itself to trigger the
>>> online/offline, but they can track opening of the device file
>>> descriptors within the group.  Within the VFIO API the user cannot
>>> access the device without the device file descriptor, so a "first
>>> device opened" and "last device closed" trigger would provide the
>>> trigger points you need.  Some sort of new sysfs interface would need
>>> to be invented to allow this sort of manipulation.
>>> Also we should probably keep sight of whether we feel this is
>>> sufficiently necessary for the complexity.  If we can get by with only
>>> doing this grouping at creation time then we could define the "create"
>>> interface in various ways.  For example:
>>>
>>> echo $UUID0 > create
>>>
>>> would create a single mdev named $UUID0 in it's own group.
>>>
>>> echo {$UUID0,$UUID1} > create
>>>
>>> could create mdev devices $UUID0 and $UUID1 grouped together.
>>>
>> </Alex>
>>
>> <Kirti>
>> I think this would create mdev device of same type on same parent
>> device. We need to consider the case of multiple mdev devices of
>> different types and with different parents to be grouped together.
>> </Kirti>
>>
>> <Alex> We could even do:
>>>
>>> echo $UUID1:$GROUPA > create
>>>
>>> where $GROUPA is the group ID of a previously created mdev device into
>>> which $UUID1 is to be created and added to the same group.
>> </Alex>
>>
>> <Kirti>
>> I was thinking about:
>>
>>   echo $UUID0 > create
>>
>> would create mdev device
>>
>>   echo $UUID0 > /sys/class/mdev/create_group
>>
>> would add created device to group.
>>
>> For multiple devices case:
>>   echo $UUID0 > create
>>   echo $UUID1 > create
>>
>> would create mdev devices which could be of different types and
>> different parents.
>>   echo $UUID0, $UUID1 > /sys/class/mdev/create_group
>>
>> would add devices in a group.
>> Mdev core module would create a new group with unique number.  On mdev
>> device 'destroy' that mdev device would be removed from the group. When
>> there are no devices left in the group, group would be deleted. With
>> this "first device opened" and "last device closed" trigger can be used
>> to commit resources.
>> Then libvirt use mdev device path to pass as argument to QEMU, same as
>> it does for VFIO. Libvirt don't have to care about group number.
>> </Kirti>
>>
> 
> The more complicated one makes this, the more difficult it is for the
> customer to configure and the more difficult it is and the longer it
> takes to get something out. I didn't follow the details of groups...
> 
> What gets created from a pass through some *mdev/create_group?  

My proposal here is, on
  echo $UUID1, $UUID2 > /sys/class/mdev/create_group
would create a group in mdev core driver, which should be internal to
mdev core module. In mdev core module, a unique group number would be
saved in mdev_device structure for each device belonging to a that group.

> Does
> some new udev device get create that then is fed to the guest?

No, group is not a device. It will be like a identifier for the use of
vendor driver to identify devices in a group.

> Seems
> painful to make two distinct/async passes through systemd/udev. I
> foresee testing nightmares with creating 3 vGPU's, processing a group
> request, while some other process/thread is deleting a vGPU... How do
> the vGPU's get marked so that the delete cannot happen.
> 

How is the same case handled for direct assigned device? I mean a device
is unbound from its vendors driver, bound to vfio_pci device. How is it
guaranteed to be assigned to vfio_pci module? some other process/thread
might unbound it from vfio_pci module?

> If a vendor wants to create their own utility to group vHBA's together
> and manage that grouping, then have at it...  Doesn't seem to be
> something libvirt needs to be or should be managing...  As I go running
> for cover...
> 
> If having multiple types generated for a single vGPU, then consider the
> following XML:
> 
>    <capability type='mdev'>
>      <type id='11' [other attributes]/>
>      <type id='11' [other attributes]/>
>      <type id='12' [other attributes]/>
>      [<uuid>...</uuid>]
>     </capability>
> 
> then perhaps building the mdev_create input would be a comma separated
> list of type's to be added... "$UUID:11,11,12". Just a thought...
> 

In that case the vGPUs are created on same physical GPUs. Consider the
case two vGPUs on different physical devices need to be assigned to a
VM. Then those should be two different create commands:

   echo $UUID0 > /sys/../<bdf1>/mdev_create
   echo $UUID1 > /sys/../<bdf2>/mdev_create

Kirti.
> 
> John
> 
>> Thanks,
>> Kirti
>>
>> --
>> libvir-list mailing list
>> libvir-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
>>




More information about the libvir-list mailing list