[libvirt] RFC: Creating mediated devices with libvirt

Thu Jun 22 16:33:16 UTC 2017

On 06/22/2017 11:28 AM, Alex Williamson wrote:
> On Thu, 22 Jun 2017 17:14:48 +0200
> Erik Skultety <eskultet at redhat.com> wrote:
> 
>> [...]
>>>>
>>>> ^this is the thing we constantly keep discussing as everyone has a slightly
>>>> different angle of view - libvirt does not implement any kind of policy,
>>>> therefore the only "configuration" would be the PCI parent placement - you say
>>>> what to do and we do it, no logic in it, that's it. Now, I don't understand
>>>> taking care of the guesswork for the user in the simplest manner possible as
>>>> policy rather as a mere convenience, be it just for developers and testers, but
>>>> even that might apparently be perceived as a policy and therefore unacceptable.
>>>>
>>>> I still stand by idea of having auto-creation as unfortunately, I sort of still
>>>> fail to understand what the negative implications of having it are - is that it
>>>> would get just unnecessarily too complex to maintain in the future that we would
>>>> regret it or that we'd get a huge amount of follow-up requests for extending the
>>>> feature or is it just that simply the interpretation of auto-create == policy?  
>>>
>>> The increasing complexity of the qemu driver is a significant concern with
>>> adding policy based logic to the code. THinking about this though, if we
>>> provide the inactive node device feature, then we can avoid essentially
>>> all new code and complexity QEMU driver, and still support auto-create.
>>>
>>> ie, in the domain XML we just continue to have the exact same XML that
>>> we already have today for mdevs, but with a single new attribute
>>> autocreate=yes|no
>>>
>>>   <devices>
>>>     <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes">
>>>     <source>
>>>       <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>  
>>
>> So, just for clarification of the concept, the device with ^this UUID will have
>> had to be defined by the nodedev API by the time we start to edit the domain
>> XML in this manner in which case the only thing the autocreate=yes would do is
>> to actually create the mdev according to the nodedev config, right? Continuing
>> with that thought, if UUID doesn't refer to any of the inactive configs it will
>> be an error I suppose? What about the fact that only one vgpu type can live on
>> the GPU? even if you can successfully identify a device using the UUID in this
>> way, you'll still face the problem, that other types might be currently
>> occupying the GPU and need to be torn down first, will this be automated as
>> well in what you suggest? I assume not.
>>
>>>     </source>
>>>     </hostdev>
>>>   </devices>
>>>
>>> In the QEMU driver, then the only change required is
>>>
>>>    if (def->autocreate)
>>>        virNodeDeviceCreate(dev)  
>>
>> Aha, so if a device gets torn down on shutdown, we won't face the problem with
>> some other devices being active, all of them will have to be in the inactive
>> state because they got torn down during the last shutdown - that would work.
> 
> 
> I'm not familiar with how inactive devices would be defined in the
> nodedev API, would someone mind explaining or providing an example
> please?  I don't understand where the metadata is stored that describes
> the what and where of a given UUID.  Thanks,

You don't understand it because it doesn't exist yet :-)

The idea is essentially the same that we've talked about, except that
all the information about parent PCI address, desired type of child, and
anything else (is there anything else?) is stored in some
not-yet-specified persistent node device config rather than directly in
the domain XML. Maybe something like:

  <nodedevice>
    <uuid>BobLobLaw</uuid>
    <parent>
      <address type='pci' .... />
    </parent>
    <child type='MoreBlah'/>
  </nodedevice>

I haven't thought about how it would show the difference between active
and inactive - didn't get enough coffee today and I have a headache.

The advantage of this is that it uncouples the  specifics of the child
device from the domain XML - the only thing in the domain XML is the
uuid. So a device config with that uuid would need to exist on every
host where you wanted to run a particular guest, but the details could
be different, yet you wouldn't need to edit the domain XML. This is a
similar concept to the idea of creating libvirt networks that are just
an indirect pointer to a bridge device (which may have a different name
on each host) or to an SRIOV PF (yeah, I know Dan doesn't like that
feature, but I find it very useful, and unobtrusive if management
chooses not to use it).

So from your point of view (I'm talking to Alex here), implementing it
this way would mean that you would need to create the child device
definitions in the nodedev driver once (and possibly/hopefully the uuid
of the devices would be autogenerated, same as we do for uuids in other
parts of libvirt config), then copy that uuid to the domain config one
time. But after doing that once, you would be able to start and stop
domains and the host without any extra action. You could also define
different nodedevices that used the same parent for different child
types, and reference them from different domain definitions, as long as
you never tried to start more than one of them at a time (I'm thinking
about Nvidia mdevs here, where you can only have one child type active
on a particular parent at any time - if you did try to do this, libvirt
would of course log an error and refuse to start the domain)

I like this idea. I think it gives both you and I what we want for
small/dev/testing purposes, and may also be of use to larger management
applications, but it won't get in anyone's way if they don't
need/want/like it.

The only downsides are:

1) It will take more effort to implement, since the nodedev driver
doesn't yet understand the concept of persistent config. (But doing it
is a *very good* thing, so it's worthwhile.)

2) it makes it pointless for me to finally hit send on the response to
this thread that I started typing all the way last Saturday, but haven't
sent because, as usual, I changed my mind 4 or 5 times in the interim
based on various discussions and "shower thoughts" :-P

... okay, another "shower thought" is coming in... One deficiency of
this comes to mind - since the domain config references the device by
uuid, and an existing child device's uuid can't be changed, the unique
uuid used by a particular domain must be defined on all of the hosts
that the domain might be moved to. And since other domains can't share
that uuid (unless you're 100% sure they'll never be active at the same
time), you won't be able to implement the alternate idea of "pre-create
all the devices, then assign them to domains as needed"; instead, you'll
be forced to use the "create-on-demand" model.

For pre-created devices to work, you really need an extra layer of
indirection - a named pool of devices, and domain config that references
the pool name rather than the uuid of a specific device. Maybe this can
be a later addition (or alternately we require management to modify the
domain config each time the domain is started, and keep track themselves
of which devices are currently in use. That seems a bit haphazard,
especially if you consider the possibility of multiple management
applications on one host)