[libvirt] [RFC PATCH] hostdev: add support for "managed='detach'"

Tue Mar 22 15:04:29 UTC 2016

On Tue, 22 Mar 2016 12:54:12 +0100
Andrea Bolognani <abologna at redhat.com> wrote:

> On Fri, 2016-03-18 at 11:03 -0600, Alex Williamson wrote:
> > > Anyway, after reading your explanation I'm wondering if we
> > > shouldn't always recommend a setup where devices that are going
> > > to be assigned to guests are just never bound to any host driver,
> > > as that sounds like it would have the most chances of working
> > > reliably.
> > > 
> > > IIUC, the pci-stubs.ids kernel parameter you mentioned above does
> > > exactly that. Maybe blacklisting the host driver as well might be
> > > a good idea? Anything else a user would need to do? Would the
> > > user or management layer not be able to configure something like
> > > that in a oVirt / OpenStack environment? What should we change
> > > to make it possible or easier?  
> > 
> > Both pci-stub.ids and blacklisting have some serious problems.  A
> > common thread on the vfio-users list is that someone has two identical
> > devices and wants to use one for the host and one for a VM and can't
> > figure out how to make that work.  
> 
> Right, because pci-stub.ids works on 'vendor product' and not
> on 'domain:bus:slot.function', so if you have two or more
> identical devices it's all or nothing.
> 
> > The only solution I know is to
> > escalate to initramfs scripts that are smarter about which device to
> > pick.  Maybe this implies that we need some initramfs integration for
> > driver_override so that we can force a particular driver for a
> > particular device, but that fails to acknowledge that a user may want
> > the device to use the default driver, up until the point that they
> > don't.  
> 
> Is that really something we want to encourage? See below.
> 
> > Blacklisting has similar issues, but worse.  Not all SR-IOV drivers are
> > partitioned such that there's a separate driver for PF vs VF, quite a
> > few use the same driver for both.  So immediately blacklisting doesn't
> > work for a large class of drivers.  In general it's even more broad
> > than pci-stub.ids, for instance I might not want snd-hda-intel to bind
> > to the GPU audio device used for a VM, but I definitely want
> > snd-hda-intel binding to my primary audio device.  I've also found that
> > marking i915 for blacklist does absolutely nothing... go figure.  
> 
> Of course driver blacklisting is even more coarse-grained
> than pci-stub.ids, so not suitable in a lot of situations.
> Got it.
> 
> > BTW, even if we do either of these, do we still need managed='yes'
> > since the device will be bound to either pci-stub or nothing and needs
> > to get to vfio-pci?  I don't recall if libvirt accepts those cases or
> > not.  If not, then we're unnecessarily bouncing back and forth between
> > drivers even if we take a pci-stub/blacklist approach unless we inject
> > another layer that actually binds them to vfio-pci.  
> 
> Well, if a device is bound to pci-stub and we want to use
> VFIO passthrough we're always going to have to unbind it
> from pci-stub and bind it to vfio-pci, aren't we?
> 
> Shouldn't that be safe anyway? As long as the device is
> never bound to a host driver...
> 
> > Let's take it one step further, what if we made an initramfs script
> > that would set "vfio-pci" for the driver_override for a user specified
> > list of devices.  Recall that driver_override means that only the
> > driver with the matching name can bind to the device.  So now we boot
> > up with the user defined set of devices without drivers.  What happens
> > in either the managed='yes' or 'no' scenarios?  Logically this seems
> > like exactly when we want to use 'detach'.  
> 
> If managed='no', libvirt will not attempt to rebind the
> device to the host driver after detaching it from the
> guest, so we're good.
> 
> If managed='yes', libvirt *will* attempt to rebind it
> to the host driver, but due to driver_override it will
> end up being bound to vfio-pci anyway: still safe.
> 
> This looks like the perfect solution for people who
> want to dedicate some host devices to VFIO passthrough
> exclusively... Of course it requires implementing the
> smart initramfs bits.
> 
> Could this be controlled by a kernel parameter? So you
> can just add something like
> 
>   vfio-pci.devices=0000:02:00.0,0000:03:00.0
> 
> to your bootloader configuration and be sure that the
> devices you've listed will never be touched by the host.
> 
> Note that I have no idea how much work implementing
> something like the above would be, or whether it would
> even be possible :)

vfio-pci is typically a module, device assignment and userspace drivers
are very niche as far as the kernel is concerned and the driver is
complex enough that building it statically into the kernel has some
disadvantages.  So adding module options to vfio-pci doesn't really
help.  We sort of tried this with adding the vfio-pci.ids module
option so we'd have parity with pci-stub, but it didn't really work
because users need to go through enough hoops to get vfio-pci loaded
early that they might as well just use the driver_override option.
Live and learn, pci-stub.ids is still generally more useful than the
equivalent vfio-pci option because pci-stub is (or should be) built
statically into the kernel.

Also, a fun property that we get to deal with in PCI space is that bus
numbers are under the control of the OS, so if you boot with
pci=assign-buses, your device address may change.  That means that
specifying a device by address is not as clean of a solution as you
might think.  It seems like this quickly devolves into a question of
whether the kernel is even the right place to do this, if you have a
properly modular kernel then userspace really has the control of how
modules are loaded and can intervene to exclude certain devices.
The kink is that we need to do that early in boot, which circles right
back around to the initramfs scripts, but afaik those are generally
managed by each distro with only loose similarities, so we'd need to
hope that we contribute something that becomes a defacto standard.

> > > That would give us a setup we can rely on, and cover all use
> > > cases you mentioned except that of someone assigning his laptop's
> > > GPU to a guest and needing the console to be available before the
> > > guest has started up because no other access to the machine is
> > > available. But in that case, even with managed='detach', the user
> > > would need to blindly restart the machine after guest shutdown,
> > > wouldn't he?  
> > 
> > I don't want to make this issue about any one particular driver because
> > I certainly hope that we'll eventually fix that one driver.  
> 
> Of course :)
> 
> > The
> > problem is more that this is not an uncommon scenario.  In one of my
> > previous replies I listed all of the driver issues that I know about.
> > In some cases we can't fix them because the driver is proprietary, in
> > others we just don't have the bandwidth.  Users have far more devices
> > at their disposal to test than we do, so they're likely going to
> > continue to run into these issues.  
> 
> Summing up the issues you listed in that message, along with
> my thoughts:
> 
>   * stealing audio device from host when running a game in a
>     guest, giving it back afterwards
> 
>     - desktop use case
>     - seems to be working fine with managed='yes'
>     - if something goes wrong when reattaching to the host,
>       the user can probably still go about his business;
>       rebooting the host is going to be annoying but not a
>       huge deal
>     - limited number of in-kernel slots means you can only
>       do this a number of time before it stops working;
>       again, having to reboot the host once every so-many
>       guest boots is probably not a deal breaker
>     - managed='detach' is not helpful here, and neither is
>       the proposal above
> 
>   * NIC reserved for guest use
> 
>     - both server and desktop use case
>     - may or may not handle going back and forth between
>       the host and the guest
>     - managed='detach' would still prevent most issues,
>       assuming the host driver detaches cleanly
>     - best would be if the host never touched the device
>     - a solution like the one outlined above seems perfect
>     - for oVirt / OpenStack, the management layer could
>       take care of updating the bootloader configuration
>     - for pure libvirt, that responsability would fall on
>       the user
>     - should virt-manager etc. rather default to
>       managed='detach' when assigning a NIC to a guest?
> 
>   * GPU with binary driver
> 
>     - both server and desktop use case
>     - doesn't handle dynamic unbind gracefully
>     - best would be if the host never touched the device
>     - a solution like the one outlined above seems perfect
>     - for oVirt / OpenStack, the management layer could
>       take care of updating the bootloader configuration
>     - for pure libvirt, that responsability would fall on
>       the user
>     - managed='detach' can prevent some failures, but not
>       those that happen on host driver unbind
> 
>   * Primary Intel GPU
> 
>     - desktop use case (any reason to do this for servers?)
>     - will probably fail on reattach
>     - mananged='detach' would prevent host crashes and the
>       like
>     - must assume the user has some other way to connect to
>       the host, because at some point they're going to be
>       unable to use the GPU from the host anyway
> 
> Have I missed anything?

I think the only thing missing is that libvirt sort of becomes a
natural management point for an assigned device because it's specified
with some management characteristics.  Users set managed='yes',
assuming libvirt will do the right thing and because the managed='no'
path requires them to come up with their own solution.  So maybe we
just need to create some framework for users to be able to take that
managed='no' path with feeling like they're stepping off a cliff into
their own adhoc management scripts.  Thanks,

Alex