[libvirt] [RFC PATCH] hostdev: add support for "managed='detach'"

Tue Mar 15 19:31:43 UTC 2016

On Tue, 15 Mar 2016 14:21:35 -0400
Laine Stump <laine at laine.org> wrote:

> On 03/15/2016 01:00 PM, Daniel P. Berrange wrote:
> > On Mon, Mar 14, 2016 at 03:41:48PM -0400, Laine Stump wrote:  
> >> Suggested by Alex Williamson.
> >>
> >> If you plan to assign a GPU to a virtual machine, but that GPU happens
> >> to be the host system console, you likely want it to start out using
> >> the host driver (so that boot messages/etc will be displayed), then
> >> later have the host driver replaced with vfio-pci for assignment to
> >> the virtual machine.
> >>
> >> However, in at least some cases (e.g. Intel i915) once the device has
> >> been detached from the host driver and attached to vfio-pci, attempts
> >> to reattach to the host driver only lead to "grief" (ask Alex for
> >> details). This means that simply using "managed='yes'" in libvirt
> >> won't work.
> >>
> >> And if you set "managed='no'" in libvirt then either you have to
> >> manually run virsh nodedev-detach prior to the first start of the
> >> guest, or you have to have a management application intelligent enough
> >> to know that it should detach from the host driver, but never reattach
> >> to it.
> >>
> >> This patch makes it simple/automatic to deal with such a case - it
> >> adds a third "managed" mode for assigned PCI devices, called
> >> "detach". It will detach ("unbind" in driver parlance) the device from
> >> the host driver prior to assigning it to the guest, but when the guest
> >> is finished with the device, will leave it bound to vfio-pci. This
> >> allows re-using the device for another guest, without requiring
> >> initial out-of-band intervention to unbind the host driver.  
> > You say that managed=yes causes pain upon re-attachment and that
> > apps should use managed=detach to avoid it, but how do management
> > apps know which devices are going to cause pain ? Libvirt isn't
> > providing any info on whether a particular device id needs to
> > use managed=yes vs managed=detach, and we don't want to be asking
> > the user to choose between modes in openstack/ovirt IMHO. I think
> > thats a fundamental problem with inventing a new value for managed
> > here.  
> 
> My suspicion is that in many/most cases users don't actually need for 
> the device to be re-bound to the host driver after the guest is finished 
> with it, because they're only going to use the device to assign to a 
> different guest anyway. But because managed='yes' is what's supplied and 
> is the easiest way to get it setup for assignment to a guest, that's 
> what they use.
> 
> As a matter of fact, all this extra churn of changing the driver back 
> and forth for devices that are only actually used when they're bound to 
> vfio-pci just wastes time, and makes it more likely that libvirt and its 
> users will reveal and get caught up in the effects of some strange 
> kernel driver loading/unloading bug (there was recently a bug reported 
> like this; unfortunately the BZ record had customer info in it, so it's 
> not publicly accessible :-( )
> 
> So beyond making this behavior available only when absolutely necessary, 
> I think it is useful in other cases, at the user's discretion (and as I 
> implied above, I think that if they understood the function and the 
> tradeoffs, most people would choose to use managed='detach' rather than 
> managed='yes')
> 
> (alternately, we could come back to the discussion of having persistent 
> nodedevice config, with one of the configurables being which devices 
> should be bound to vfio-pci when libvirtd is started. Did we maybe even 
> talk about exactly that in the past? I can't remember... That would of 
> course preclude the use case where someone 1) normally wanted to use the 
> device for the host, but 2) occasionally wanted to use it for a guest, 
> after which 3) they were well aware that they would need to reboot the 
> host before they could use the device on the host again. I know, I know 
> - "odd edge cases", and in particular "odd edge cases only encountered 
> by people who know other ways of working around the problem" :-))
> 
> 
> > Can you provide more details about the problems with detaching ?
> >
> > Is this inherant to all VGA cards, or is it specific to the Intel
> > i915, or specific to a kernel version or something else ?
> >
> > I feel like this is something where libvirt should "do the right
> > thing", since that's really what managed=yes is all about.
> >
> > eg, if we have managed=yes and we see an i915, we should
> > automatically skip re-attach for that device.  
> 
> 
> Alex can give a much better description of that than I can (I had told 
> git to Cc him on the original patch, but it seems it didn't do that; I'm 
> trying again). But what if there is such a behavior now for a certain 
> set of VGA cards, and it gets fixed in the future? Would we continue to 
> force avoiding re-attach for the device? I understand the allure of 
> always doing the right thing without requiring config (and the dislike 
> of adding new seemingly esoteric options), but I don't know that libvirt 
> has (or can get) the necessary info to make the correct decision in all 
> cases.

I agree, blacklisting VGA devices or any other specific device types or
host drivers is bound to be the wrong thing to do for someone or at
some point in time.  I think if we look at the way devices are
typically used for device assignment, we'd probably see that they're
used exclusively for device assignment or exclusively for the host.  My
guess is that it's a much less common scenario that a user actually
wants to steal a device from the host only while a VM is using it.  It
is done though, I know of folks that steal an audio device from the
host when they run their game VM and give it back when shutdown.  I
don't know that it's possible for libvirt to always just do the right
thing here, it involves inferring the intentions of the user.

So here are the types of things we're dealing with that made me suggest
this idea to Laine; in the i915 scenario, the Intel graphics device
(IGD) is typically the primary host graphics.  If we want to assign it
to a VM, obviously at some point it needs to move to vfio-pci, but do
we know that the user has an alternate console configured or do they go
headless when that happens?  If they go headless then they probably
don't want to use kernel boot options and blacklisting to prevent i915
from claiming the device or getting it attached to pci-stub or
vfio-pci.  Often that's not even enough since efifb or vesafb might try
to claim resources of the device even if the PCI driver is prevented
from doing so.  In such a case, it's pretty convenient that the user
can just set managed='yes' and the device gets snatched away from the
host when the VM starts... but then the i915 driver sometimes barfs
when the VM is shutdown and and i915 takes back the device.  The host is
left in a mostly unusable state.  Yes, the user could do a
nodedev-detach before starting the VM and yes, the i915 driver issue
may just be temporary, but this isn't the first time this has
happened.

As Laine mentioned, we've seen another customer issue where a certain
NIC is left in an inconsistent state, sometimes, when returned to the
host.  They have absolutely no use for this NIC on the host, so this
was mostly a pointless operation anyway.  In this case we had to use a
pci-stub.ids option to prevent the host NIC driver from touching the
devices since there was really no easy way to set manage='no' and
pre-bind the devices to vfio-pci in their ovirt/openstack environment.
NICs usually fair better at repeated attach/detach scenarios thanks to
physical hotplug support, but it's really a question of how robust is
the driver.  For instance, how many people are out there hotplugging
$10 Realtek NICs vs multi-hundred dollar enterprise class NICs?  Has
anyone ever done physical hotplug of a graphics card, sound cards or
USB controller?

Even in the scenario I mention above where the user thinks they want to
bounce their audio device back and forth between VM and host, there's
actually a fixed array of alsa cards in the kernel and unbinding a
device just leaks that slot. :-\

We also have nvidia.ko, which not only messes with the device to the
point where it may or may not work in the VM, but the vendor doesn't
support dynamically unbinding devices.  It will still do it, but it
kinda forgets to tell Xorg to stop using the device.  We generally
recommend folks doing GPU assignment to avoid the host driver
altogether, nouveau and radeon sometimes don't even like to do the
unbind.  i915 is actually doing better than average in this case and the
fact that it's typically the primary graphics sort of breaks that rule
anyway.

So we have all sorts of driver issues that are sure to come and go over
time and all sorts of use cases that seem difficult to predict.  If we
know we're in a ovirt/openstack environment, managed='detach' might
actually be a more typical use case than managed='yes'.  It still
leaves a gap that we hope the host driver doesn't do anything bad when
it initializes the device and hope that it releases the device cleanly,
but it's probably better than tempting fate by unnecessarily bouncing
it back and forth between drivers. Thanks,

Alex