[libvirt] RFC: Creating mediated devices with libvirt

Martin Polednik mpolednik at redhat.com
Thu Jun 22 12:55:13 UTC 2017


On 22/06/17 14:05 +0200, Erik Skultety wrote:
>On Thu, Jun 22, 2017 at 10:41:13AM +0200, Martin Polednik wrote:
>> On 16/06/17 18:14 +0100, Daniel P. Berrange wrote:
>> > On Fri, Jun 16, 2017 at 06:11:17PM +0100, Daniel P. Berrange wrote:
>> > > On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
>> > > > On Fri, 16 Jun 2017 11:32:04 -0400
>> > > > Laine Stump <laine at redhat.com> wrote:
>> > > >
>> > > > > On 06/15/2017 02:42 PM, Alex Williamson wrote:
>> > > > > > On Thu, 15 Jun 2017 09:33:01 +0100
>> > > > > > "Daniel P. Berrange" <berrange at redhat.com> wrote:
>> > > > > >
>> > > > > >> On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
>> > > > > >>> Hi all,
>> > > > > >>>
>> > > > > >>> so there's been an off-list discussion about finally implementing creation of
>> > > > > >>> mediated devices with libvirt and it's more than desired to get as many opinions
>> > > > > >>> on that as possible, so please do share your ideas. This did come up already as
>> > > > > >>> part of some older threads ([1] for example), so this will be a respin of the
>> > > > > >>> discussions. Long story short, we decided to put device creation off and focus
>> > > > > >>> on the introduction of the framework as such first and build upon that later,
>> > > > > >>> i.e. now.
>> > > > > >>>
>> > > > > >>> [1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
>> > > > > >>>
>> > > > > >>> ========================================
>> > > > > >>> PART 1: NODEDEV-DRIVER
>> > > > > >>> ========================================
>> > > > > >>>
>> > > > > >>> API-wise, device creation through the nodedev driver should be pretty
>> > > > > >>> straightforward and without any issues, since virNodeDevCreateXML takes an XML
>> > > > > >>> and does support flags. Looking at the current device XML:
>> > > > > >>>
>> > > > > >>> <device>
>> > > > > >>>   <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name>
>> > > > > >>>   <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path>
>> > > > > >>>   <parent>pci_0000_03_00_0</parent>
>> > > > > >>>   <driver>
>> > > > > >>>     <name>vfio_mdev</name>
>> > > > > >>>   </driver>
>> > > > > >>>   <capability type='mdev'>
>> > > > > >>>     <type id='nvidia-11'/>
>> > > > > >>>     <iommuGroup number='13'/>
>> > > > > >>>     <uuid>UUID<uuid> <!-- optional enhancement, see below -->
>> > > > > >>>   </capability>
>> > > > > >>> </device>
>> > > > > >>>
>> > > > > >>> We can ignore <path>,<driver>,<iommugroup> elements, since these are useless
>> > > > > >>> during creation. We also cannot use <name> since we don't support arbitrary
>> > > > > >>> names and we also can't rely on users providing a name in correct form which we
>> > > > > >>> would need to further parse in order to get the UUID.
>> > > > > >>> So since the only thing missing to successfully use create an mdev using XML is
>> > > > > >>> the UUID (if user doesn't want it to be generated automatically), how about
>> > > > > >>> having a <uuid> subelement under <capability> just like PCIs have <domain> and
>> > > > > >>> friends, USBs have <bus> & <device>, interfaces have <address> to uniquely
>> > > > > >>> identify the device even if the name itself is unique.
>> > > > > >>> Removal of a device should work as well, although we might want to
>> > > > > >>> consider creating a *Flags version of the API.
>> > > > > >>>
>> > > > > >>> =============================================================
>> > > > > >>> PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED!
>> > > > > >>> =============================================================
>> > > > > >>>
>> > > > > >>> There were some doubts about auto-creation mentioned in [1], although they
>> > > > > >>> weren't specified further. So hopefully, we'll get further in the discussion
>> > > > > >>> this time.
>> > > > > >>>
>> > > > > >>> From my perspective there are two main reasons/benefits to that:
>> > > > > >>>
>> > > > > >>> 1) Convenience
>> > > > > >>> For apps like virt-manager, user will want to add a host device transparently,
>> > > > > >>> "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for
>> > > > > >>> higher management apps, like oVirt, even they might not care about the parent
>> > > > > >>> device at all times and considering that they would need to enumerate the
>> > > > > >>> parents, pick one, create the device XML and pass it to the nodedev driver, IMHO
>> > > > > >>> it would actually	 be easier and faster to just do it directly through sysfs,
>> > > > > >>> bypassing libvirt once again....
>> > > > > >>
>> > > > > >> The convenience only works if the policy we've provided in libvirt actually
>> > > > > >> matches the policy the application wants. I think it is quite likely that with
>> > > > > >> cloud the mdevs will be created out of band from the domain startup process.
>> > > > > >> It is possible the app will just have a fixed set of mdevs pre-created when
>> > > > > >> the host starts up. Or that the mgmt app wants the domain startup process to
>> > > > > >> be a two phase setup, where it first allocates the resources needed, and later
>> > > > > >> then tries to start the guest. This is why I keep saying that putting this kind
>> > > > > >> of "convenient" policy in libvirt is a bad idea - it is essentially just putting
>> > > > > >> a bit of virt-manager code into libvirt - more advanced apps will need more
>> > > > > >> flexibility in this area.
>> > > > > >>
>> > > > > >>> 2) Future domain migration
>> > > > > >>> Suppose now that the mdev backing physical devices support state dump and
>> > > > > >>> reload. Chances are, that the corresponding mdev doesn't even exist or has a
>> > > > > >>> different UUID on the destination, so libvirt would do its best to handle this
>> > > > > >>> before the domain could be resumed.
>> > > > > >>
>> > > > > >> This is not an unusual scenario - there are already many other parts of the
>> > > > > >> device backend config that need to change prior to migration, especially for
>> > > > > >> anything related to host devices, so apps already have support for doing
>> > > > > >> this, which is more flexible & convenient becasue it doesn't tie creation of
>> > > > > >> the mdevs to running of the migrate command.
>> > > > > >>
>> > > > > >> IOW, I'm still against adding any kind of automatic creation policy for
>> > > > > >> mdevs in libvirt. Just provide the node device API support.
>> > > > > >
>> > > > > > I'm not super clear on the extent of what you're against here, is it
>> > > > > > all forms of device creation or only a placement policy?  Are you
>> > > > > > against any form of having the XML specify the non-instantiated mdev
>> > > > > > that it wants?  We've clearly made an important step with libvirt
>> > > > > > supporting pre-created mdevs, but as a user of that support I find it
>> > > > > > incredibly tedious.  I typically do a dumpxml, copy out the UUID,
>> > > > > > wonder what type of device it might have been last time, create it,
>> > > > > > start the domain and cross my fingers. Pre-creating mdev devices is not
>> > > > > > really practical, I might have use cases where I want multiple low-end
>> > > > > > mdev devices and another where I have a single high-end device.  Those
>> > > > > > cannot exist at the same time. Requiring extensive higher level
>> > > > > > management tools is not really an option either, I'm not going to
>> > > > > > install oVirt on my desktop/laptop just so I can launch a GVT-g VM once
>> > > > > > in a while (no offense).  So I really hope that libvirt itself can
>> > > > > > provide some degree of mdev creation.
>> > > > >
>> > > > >
>> > > > > Maybe there can be something in between the "all child devices must be
>> > > > > pre-created" and "a child device will be automatically created on an
>> > > > > automatically chosen parent device as needed". In particular, we could
>> > > > > forego the "automatically chosen parent device" part of that. The guest
>> > > > > configuration could simply contain the PCI address of the parent and the
>> > > > > desired type of the child. If we did this there wouldn't be any policy
>> > > > > decision to make - all the variables are determined - but it would make
>> > > > > life easier for people running small hosts (i.e. no oVirt/Openstack, a
>> > > > > single mdev parent device). Openstack and oVirt (and whoever) would of
>> > > > > course be free to ignore this and pre-create pools of devices themselves
>> > > > > in the name of more precise control and better predictability (just as,
>> > > > > for example, OpenStack ignores libvirt's "pools of hostdev network
>> > > > > devices" and instead manages the pool of devices itself and uses
>> > > > > <interface type='hostdev'> directly).
>> > > >
>> > > > This seems not that substantially different from managed='yes' on a
>> > > > vfio hostdev to me.  It makes the device available to the VM before it
>> > > > starts and returns it after.  In one case that's switching the binding
>> > > > on an existing device, in another it's creating and removing.  Once
>> > > > again, I can't tell from Dan's response if he's opposed to this entire
>> > > > idea or just the aspects where libvirt needs to impose a policy
>> > > > decision.  For me personally, the functionality difference is quite
>> > > > substantial.
>> > >
>> > > I'm fine with libvirt having APIs in the node device APIs to enable
>> > > create/delete with libvirt, as well as using managed=yes in the same
>> > > manner that we do for regular PCI devices (the bind/unbind to vfio
>> > > or pci-back)
>> >
>> > Oh, and we really need to fix the big missing feature in the node
>> > device APIs of persistent, inactive configs. eg we should be able
>> > to record XML configs of mdevs (and npiv devices too), in /etc/libvirt
>> > so they persist across reboots, and can be setup for auto-start on
>> > boot too.
>>
>> That doesn't help mdev in any way though. It doesn't make sense to
>> generate new UUID for given VM at each start. So in case of
>
>What statement does this^^ refer to? Why would you generate a new UUID for a VM
>at each start, you'd generate it only once and then store it, the same way as
>domain UUIDs work.
>
>> single host, the persistent file is redundant to the domain XML (as
>> long as uuid+parent is in the xml) and in case of cluster we'd have to
>
>Right now you don't have any info about the parent device in the domain XML and
>such data would only exist in the XML if we all agreed on auto-creating mdevs,
>in which case persistent configs in nodedev would be unnecessary and vice-versa.
>
>> copy all possible VM mdev definitions to all the hosts.
>
>^For mdev configs, you might be better off with creating them explicitly than
>copying configs, simply because given the information the XML has, you might
>conflict with UUIDs between hosts, so you'd have to take care for that. Parents
>have different PCI addresses that most probably wouldn't match across hosts, so
>from automation point of view, I think writing a stub recreating the whole set
>of devices/configs might actually be easier than copying & handling them
>(solely because the 2 things left - after the ones I mentioned - in the XML are
>the vgpu type and IOMMU group number which AFAIK cannot be requested explicitly).
>
>>
>> The idea works nicely if you had such definitions accessible in the
>> cluster and could define a group of devices (gpu+soundcard, single
>> mdev, single vf, ...) that would later be assigned to a VM (let's hope
>> kubevirt can get there).
>>
>> As for automatic creation, I think it's on the "nice to have" level.
>> So far libvirt is close to useless when working with mdevs as all the
>> data is in the same sysfs place where create/delete endpoints are - as
>> mentioned earlier, we can just get the data and do everything directly
>> from there instead of dealing with XML and bunch of new API calls.
>> Having at least some *configurable* auto create policy might add some
>
>^this is the thing we constantly keep discussing as everyone has a slightly
>different angle of view - libvirt does not implement any kind of policy,
>therefore the only "configuration" would be the PCI parent placement - you say
>what to do and we do it, no logic in it, that's it. Now, I don't understand
>taking care of the guesswork for the user in the simplest manner possible as
>policy rather as a mere convenience, be it just for developers and testers, but
>even that might apparently be perceived as a policy and therefore unacceptable.
>
>I still stand by idea of having auto-creation as unfortunately, I sort of still
>fail to understand what the negative implications of having it are - is that it
>would get just unnecessarily too complex to maintain in the future that we would
>regret it or that we'd get a huge amount of follow-up requests for extending the
>feature or is it just that simply the interpretation of auto-create == policy?

Optional creation is fine. It could also be helpful in future to carry
over some device information that we would otherwise have to push into
metadata. Possible device XML idea:

  <devices>
    <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
    <source>
      <address uuid='$uuid'>
      <autocreate/>
      <address type='pci' ...> <!-- parent? -->
      <mdev type=''> <!-- mdev_type? -->
    </source>
    </hostdev>
  </devices>

This change gives us a) convenient auto-creation b) data that will be
needed for cloud use cases anyway. Such solution doesn't enforce any
policy on management software but would definitely be a strong point
for libvirt usage (instead of sysfs access).

>Thanks,
>Erik




More information about the libvir-list mailing list