[libvirt] RFC: Creating mediated devices with libvirt

Thu Jun 22 21:02:15 UTC 2017

On 06/22/2017 11:52 AM, Pavel Hrdina wrote:
> On Thu, Jun 22, 2017 at 09:28:57AM -0600, Alex Williamson wrote:
>> On Thu, 22 Jun 2017 17:14:48 +0200
>> Erik Skultety <eskultet at redhat.com> wrote:
>>
>>> [...]
>>>>>
>>>>> ^this is the thing we constantly keep discussing as everyone has a slightly
>>>>> different angle of view - libvirt does not implement any kind of policy,
>>>>> therefore the only "configuration" would be the PCI parent placement - you say
>>>>> what to do and we do it, no logic in it, that's it. Now, I don't understand
>>>>> taking care of the guesswork for the user in the simplest manner possible as
>>>>> policy rather as a mere convenience, be it just for developers and testers, but
>>>>> even that might apparently be perceived as a policy and therefore unacceptable.
>>>>>
>>>>> I still stand by idea of having auto-creation as unfortunately, I sort of still
>>>>> fail to understand what the negative implications of having it are - is that it
>>>>> would get just unnecessarily too complex to maintain in the future that we would
>>>>> regret it or that we'd get a huge amount of follow-up requests for extending the
>>>>> feature or is it just that simply the interpretation of auto-create == policy?  
>>>>
>>>> The increasing complexity of the qemu driver is a significant concern with
>>>> adding policy based logic to the code. THinking about this though, if we
>>>> provide the inactive node device feature, then we can avoid essentially
>>>> all new code and complexity QEMU driver, and still support auto-create.
>>>>
>>>> ie, in the domain XML we just continue to have the exact same XML that
>>>> we already have today for mdevs, but with a single new attribute
>>>> autocreate=yes|no
>>>>
>>>>   <devices>
>>>>     <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes">
>>>>     <source>
>>>>       <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>  
>>>
>>> So, just for clarification of the concept, the device with ^this UUID will have
>>> had to be defined by the nodedev API by the time we start to edit the domain
>>> XML in this manner in which case the only thing the autocreate=yes would do is
>>> to actually create the mdev according to the nodedev config, right? Continuing
>>> with that thought, if UUID doesn't refer to any of the inactive configs it will
>>> be an error I suppose? What about the fact that only one vgpu type can live on
>>> the GPU? even if you can successfully identify a device using the UUID in this
>>> way, you'll still face the problem, that other types might be currently
>>> occupying the GPU and need to be torn down first, will this be automated as
>>> well in what you suggest? I assume not.
>>>
>>>>     </source>
>>>>     </hostdev>
>>>>   </devices>
>>>>
>>>> In the QEMU driver, then the only change required is
>>>>
>>>>    if (def->autocreate)
>>>>        virNodeDeviceCreate(dev)  
>>>
>>> Aha, so if a device gets torn down on shutdown, we won't face the problem with
>>> some other devices being active, all of them will have to be in the inactive
>>> state because they got torn down during the last shutdown - that would work.
>>
>>
>> I'm not familiar with how inactive devices would be defined in the
>> nodedev API, would someone mind explaining or providing an example
>> please?  I don't understand where the metadata is stored that describes
>> the what and where of a given UUID.  Thanks,
> 
> It would basically copy what we do for domains.  Currently there is
> virNodeDeviceCreateXML() which takes the XML definitions and creates a
> new active node device and virNodeDeviceDestroy() which takes as
> argument an object of existing active node device.

FWIW: (Just in case someone doesn't know yet...) The only current
CreateXML consumer is for NPIV/vHBA devices. As I've pointed out before
I see a lot of similarities w/ mdev because they both have a dependency
on "something else" in order for proper creation. NPIV/vHBA requires an
HBA (scsi_hostN) that has a sysfs structure with a vport_create function
to create the vHBA. The HBA scsi_hostN is instantiated during
udevEnumerateDevices processing while the vHBA scsi_hostM is created
during udevEventHandleCallback.

The CreateXML provides an essentially 'transient' model to describe
a(the) vHBA device(s). After host reboot, one would have to run virsh
nodedev-create file.xml in order to recreate their vHBA.

In order to create more permanent vHBA's, it's possible to define a
storage pool that would create the vHBA when the storage pool is
started. So while there's no DefineXML support, there is a model that
does provide a mechanism to have persistence without needing to have a
DefineXML for node devices.

> 
> We would extend the functionality with new APIs:
> 
>   - virNodeDeviceCreate() which would take as argument an object of
>     existing inactive node device.
> 
>   - virNodeDeviceDefineXML() would define the node device as inactive.
> 
> With the virNodeDeviceDefineXML() you would create a list of predefined
> inactive devices which could be obtained by
> virConnectListAllNodeDevices() for example.
> 

Given various experiences with HBA/vHBA, I wonder if we should just let
udev (and it's predecessor HAL) be the only thing that "defines" what a
node device is (keeping vHBA for historical purposes).

Of perhaps related concern/interest - there was a recent series on list
related to mdev and some underlying udev/systemd/kernel issue that
results in "inconsistent" failures. The proposed fix involved wait
loops. I pointed out to Erik that a prior concern over any wait loop I
would add for problems with vHBA initialization was that they could
unnecessary waits for libvirtd startup processing.

Additionally, if we added a read/process the define'd XML's processing
to node device, would that then run into troubles and cause startup
failures. Do we ignore failures? Do we continue to add wait threads to
get specific data that wasn't present at some point in time but will be
soon. The node device initialization is fairly early on (network,
interface, storage, node device, ...).

John

And as I've seen written by Erik before - I'll reply to the top level
with another idea rather than just looking like a long complaint ;-).

> Internally we would store XML files the same way as we do for domains,
> somewhere in "/etc/libvirt/..." and like with domains the APIs would
> work with these files.
> 
> In virsh terms there would be similar analogy to the domain commands:
> 
> "virsh nodedev-start" could simply map to virNodeDeviceCreate() and
> would work like "virsh start" for domains and "virsh nodedev-define"
> woudl map to virNodeDeviceDefineXML() and work the same way as
> "virsh define".  You could simply list the predefined mdev devices
> using "virsh nodedev-list", get UUID of existing mdev device and use it
> in a domain.
> 
> In virt-manager there could be new type of hostdev device where you
> could select on of existing mdev devices from a drop-down list where
> virt-manager would show nice user-friendly descriptions of the mdev
> devices but under the hood it would put the UUID in the domain XML.
> 
> Pavel
> 
>>
>> Alex
>>
>> --
>> libvir-list mailing list
>> libvir-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
>>
>>
>> --
>> libvir-list mailing list
>> libvir-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list