[libvirt] RFC: Creating mediated devices with libvirt

Thu Jun 22 16:38:08 UTC 2017

On 06/22/2017 12:15 PM, Daniel P. Berrange wrote:
> On Thu, Jun 22, 2017 at 05:14:48PM +0200, Erik Skultety wrote:
>> [...]
>>>>
>>>> ^this is the thing we constantly keep discussing as everyone has a slightly
>>>> different angle of view - libvirt does not implement any kind of policy,
>>>> therefore the only "configuration" would be the PCI parent placement - you say
>>>> what to do and we do it, no logic in it, that's it. Now, I don't understand
>>>> taking care of the guesswork for the user in the simplest manner possible as
>>>> policy rather as a mere convenience, be it just for developers and testers, but
>>>> even that might apparently be perceived as a policy and therefore unacceptable.
>>>>
>>>> I still stand by idea of having auto-creation as unfortunately, I sort of still
>>>> fail to understand what the negative implications of having it are - is that it
>>>> would get just unnecessarily too complex to maintain in the future that we would
>>>> regret it or that we'd get a huge amount of follow-up requests for extending the
>>>> feature or is it just that simply the interpretation of auto-create == policy?
>>>
>>> The increasing complexity of the qemu driver is a significant concern with
>>> adding policy based logic to the code. THinking about this though, if we
>>> provide the inactive node device feature, then we can avoid essentially
>>> all new code and complexity QEMU driver, and still support auto-create.
>>>
>>> ie, in the domain XML we just continue to have the exact same XML that
>>> we already have today for mdevs, but with a single new attribute
>>> autocreate=yes|no
>>>
>>>   <devices>
>>>     <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes">
>>>     <source>
>>>       <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
>>
>> So, just for clarification of the concept, the device with ^this UUID will have
>> had to be defined by the nodedev API by the time we start to edit the domain
>> XML in this manner in which case the only thing the autocreate=yes would do is
>> to actually create the mdev according to the nodedev config, right? Continuing
>> with that thought, if UUID doesn't refer to any of the inactive configs it will
>> be an error I suppose? What about the fact that only one vgpu type can live on
>> the GPU? even if you can successfully identify a device using the UUID in this
>> way, you'll still face the problem, that other types might be currently
>> occupying the GPU and need to be torn down first, will this be automated as
>> well in what you suggest? I assume not.
> 
> Technically we shouldn't need the node device to exist at the time we
> define the XML - only at the time we start the guest, does the node
> device have to exist. eg same way you list a virtual network as the
> source of a guest NIC, but that virtual network doesn't have to actually
> have been defined & started until the guest starts.
> 
> If there are constraints that a pGPU can only support a certain combination
> of vGPUs at any single point in time, doesn't the kernel already  enforce
> that when you try to create the vGPU in sysfs. IOW, we merely need to try
> to create the vGPU, and if the kernel mdev driver doesn't allow you to mix
> that with the other vGPUs that already exist, then we'd just report an
> error from virNodeDeviceCreate, and that'd get propagated back as the
> error for the virDomainCreate call.
> 
>>
>>>     </source>
>>>     </hostdev>
>>>   </devices>
>>>
>>> In the QEMU driver, then the only change required is
>>>
>>>    if (def->autocreate)
>>>        virNodeDeviceCreate(dev)
>>
>> Aha, so if a device gets torn down on shutdown, we won't face the problem with
>> some other devices being active, all of them will have to be in the inactive
>> state because they got torn down during the last shutdown - that would work.
> 
> I'm not sure what the relationship with other active devices is relevant
> here. The virNodeDevicePtr we're accesing here is a single vGPU - if other
> running guests have further vGPUs on the same pGPU, that's not really
> relevant. Each vGPU is created/deleted as required.

I think he's talking about devices that were previously used by other
domains that are no longer active. Since they're also automatically
destroyed, they're not a problem.