[libvirt] [PATCH 4/6] conf: new <hotplug require='yes|no'/> element for hotpluggable devices

Wed Aug 10 13:48:14 UTC 2016

On Tue, Aug 09, 2016 at 02:14:15PM -0400, Laine Stump wrote:
> On 08/09/2016 04:05 AM, Daniel P. Berrange wrote:
> > On Mon, Aug 08, 2016 at 12:41:48PM -0400, Laine Stump wrote:
> > > On 08/08/2016 04:56 AM, Laine Stump wrote:
> > > > When faced with a guest device that requires a PCI address but doesn't
> > > > have one manually assigned in the config, libvirt has always insisted
> > > > (well... *tried* to insist) on auto-assigning an address that is on a
> > > > PCI controller that supports hotplug. One big problem with this is
> > > > that it prevents automatic use of a Q35 (or aarch64/virt) machine's
> > > > pcie-root (since the PCIe root complex doesn't support hotplug).
> > > > 
> > > > In order to promote simpler domain configs (more devices on pcie-root
> > > > rather than on a pci-bridge), this patch adds a new sub-element to all
> > > > guest devices that have a PCI address and support hotplug:
> > > > 
> > > >     <hotplug require='no'/>
> > > > 
> > > > For devices that have hotplug require='no', we turn off the
> > > > VIR_PCI_CONNECT_HOGPLUGGABLE bit in the devFlags when searching for an
> > > > available PCI address. Since pcie-root now allows standard PCI
> > > > devices, this results in those devices being placed on pcie-root
> > > > rather than pci-bridge.
> > > I've been playing around with this and, by itself, it works very well. With
> > > this solved, combined with taking advantage of PCIe for virtio when
> > > available, it's very easy to create q35 domains that have no legacy-PCI
> > > without needing to resort to manually assigning addresses.
> > > 
> > > However, there is still another item that we need to be able to configure -
> > > stating a preference of legacy PCI vs. PCIe when both are available for a
> > > device (again, the aim is to do this *without* needing to manually assign an
> > > address). The following devices have this choice:
> > > 
> > > 1) vfio assigned devices
> > > 2) virtio devices
> > > 3) the nec-xhci USB controller
> > > 
> > > You might think that it would always be preferable to use PCIe if it's
> > > available, but especially in these "early days" of using PCIe in guests it
> > > would be useful to have to ability to *easily* force use of a legacy PCI
> > > slot in case some PCIe-related bug is encountered (in particular, people
> > > have pointed out in discussions about vfio device assignment that it could
> > > be possible for a guest OS to misbehave when presented with a device's PCIe
> > > configuration block (which hasn't been visible in the past because the
> > > device was attached to a legacy PCI slot)).
> > > 
> > > In order maintain functionality while any such bugs are figured out and
> > > fixed, we need to be able to force the device onto a PCI slot. There are two
> > > ways of doing this:
> > > 
> > > 1) manually specify the full PCI address of a legacy PCI slot in the config
> > > 2) provide an option in the config that simply says "use any PCI slot" or
> > > "use any PCIe slot".
> > > Assuming that (1) is too cumbersome, we need to come up with a reasonable
> > > name/location for a config option (providing the backend for it will be
> > > trivial). Some possible places:
> > I prefer a variant on (1), which is to specifcy an address with only the
> > controller index filled out. eg given a q35 bridge topology
> > 
> >      <controller type='pci' index='0' model='pcie-root'/>
> >      <controller type='pci' index='1' model='dmi-to-pci-bridge'>
> >        <model name='i82801b11-bridge'/>
> >      </controller>
> >      <controller type='pci' index='2' model='pci-bridge'>
> >        <model name='pci-bridge'/>
> >        <target chassisNr='56'/>
> >      </controller>
> > 
> > A device would use
> > 
> >     <address type="pci" controller="2">   (for pci-bridge placement)
> >     <address type="pci" controller="0">   (for pcie-root placement)
> > 
> > 
> > This trivially expaneds to cover the NUMA use case too, without having to
> > invent further elements duplicating NUMA node info again.
> > 
> > 
> >      <controller type='pci' index='0' model='pci-root'/>
> >      <controller type='pci' index='1' model='pci-expander-bus'>
> >        <model name='pxb'/>
> >        <target busNr='254'>
> >          <node>0</node>
> >        </target>
> >      </controller>
> >      <controller type='pci' index='2' model='pci-expander-bus'>
> >        <model name='pxb'/>
> >        <target busNr='255'>
> >          <node>1</node>
> >        </target>
> >      </controller>
> > 
> > eg device uses
> > 
> >     <address type="pci" controller="1">   (for pxb on NUMA node 0)
> >     <address type="pci" controller="2">   (for pxb on NUMA node 1)
> > 
> > 
> > Yes, when you first boot a guest, this means the mgmt app has to know
> > what controllers exist by default, and/or specify controllers,
> 
> It also has to know the rules about which controllers support what type of
> devices, and which controllers can plug into which other controllers - the
> full topology of PCI controllers will need to be spelled out specifically in
> advance, so that the management app can give the index number of the correct
> controller.
> 
> As it stands now, some amount of this is already necessary, just because
> libvirt currently will only automatically add pci-bridge devices when there
> aren't enough PCI slots available (the code to automatically add pcie
> controllers as necessary is yet to be written). Still, this is all it takes
> to get 8 properly connected downstream-ports:
> 
> <controller type='pci' model='pcie-root-port'/>
> <controller type='pci' model='pcie-switch-upstream-port'/>
> <controller type='pci' model='pcie-switch-downstream-port'/>
> <controller type='pci' model='pcie-switch-downstream-port'/>
> <controller type='pci' model='pcie-switch-downstream-port'/>
> <controller type='pci' model='pcie-switch-downstream-port'/>
> <controller type='pci' model='pcie-switch-downstream-port'/>
> <controller type='pci' model='pcie-switch-downstream-port'/>
> <controller type='pci' model='pcie-switch-downstream-port'/>
> <controller type='pci' model='pcie-switch-downstream-port'/>
> 
> (note that you don't need to say where they are connected, nor do you need
> to give an index to each one. As long as there is an available slot of the
> correct type for each controller as it's encountered, libvirt will hook them
> up and index them correctly)
> 
> If the controller number was needed to specify in the device address, then
> not only would each controller need to be manually given an index (if you
> were specifying the entire domain in a single go), but each device would
> need to be given a unique setting - instead of just saying "I want one of
> these" (in the future, if there isn't "one of those" available, everything
> necessary to provide it will be automatically created), you have to say "I
> want *this* one, implying that you know everything about what "this one"
> provides (and you will need to manually setup "this one" and everything else
> required to provide it).
> 
> Also, note that since each pci-root-port and pci-switch-downstream-port has
> only a single slot, in those cases this:
> 
>    <address type="pci" controller="1"/>
> 
> is really just:
> 
>    <address type="pci" bus="1"/>
> 
> and for the other cases (pci[e]-root, pci-bridge, dmi-to-pci-bridge) you
> will still have to keep track of how many devices you've assigned to a
> particular controller, to make sure that your request doesn't overflow the
> 31 (or 32) device limit (which brings up another problem of doing it this
> way - different controllers have different numbers of useful slots, so the
> management app will have to know about that too).
> 
> >   but I
> > think that's ultimately preferrable than inventing an indefinitely
> > growing list of extra XML elements to provide tweaks to teh PCI address
> > assginment logic.
> > 
> > We can simplify life of mgmt apps in a different way, but using the domain
> > XML capabilities to provide full data on the default controllers used for
> > each machine type.
> 
> And pcie/pci info on every endpoint device (I agree with your assertion that
> we should treat every device as potentially capable of hotplug, so we don't
> need that info for endpoints).
> 
> >   So apps would not need to have any machine type specific
> > logic in them - they can write code that's entirely metadata driven based
> > on the domain capabilities.
> 
> The metadata would need to include:
> 
> 1) upstream connection type (each type of PCI controller needs a different
>     connection type. I've learned through all this that each controller has
> different
>     rules about what it can be connected to. So far there are 9 different
> types,
>     including pci-endpoint and pcie-endpoint)
> 2) accepted downstream connection types
> 3) min and max slot (currently known possibilities: 0-0, 0-31, or 0-32)
> 4) hotplug supported or not (yes, the management app really would need
>      to know about this, even if they intended for all devices to be
> hotpluggable
>      - they need to be able to avoid pcie-root and dmi-to-pci-bridge)
> 
> Then the management app would need to keep track of which addresses are in
> use for each controller, so that it would be able to avoid putting too many
> devices on any controller.
> 
> Also, the management app will now need to know for *every* device whether
> that device in this particular version of qemu is a PCIe device, a legacy
> PCI device, or one of the silly "dual mode" devices that changes according
> to the type of bus it's plugged into. Otherwise it won't know which
> controller it should plug the device into.
> 
> Finally, the management app would need to add logic to grow the bus topology
> correctly as needed. Sometimes this is simple and sometimes not. For
> example, if there is an open slot available on pcie-root, you can get
> another hotpluggable pcie slot by simply adding a pcie-root-port. If not,
> then you need to find an open pcie-switch-downstream-port or pcie-root-port,
> then connect a pcie-switch-upstream-port to that, and connect one or more
> pcie-switch-downstream-ports to that. (note this implies that you should
> always maintain at least one unoccupied root-port, downstream-port, or an
> open slot on pcie-root).
> 
> 
> The whole point of my exercise has been to 1) get rid of as much legacy PCI
> baggage as possible, and 2) do this in a way that prevents/eliminates the
> need for multiple management apps to re-implement and maintain essentially
> the same code (especially since my trip through PCIe-land has shown me that
> it is fraught with lack of documentation (about the controller
> implementations, not the PCIe spec), misinformation (changing over time),
> misconceptions, and lots of annoying little details). I started out kind of
> liking your alternate idea, since it seemed to accomplish that in a simpler
> way that didn't require adding yet another knob later. But after thinking it
> through, I think that it provides very little extra functionality, and would
> require a lot of extra work for the management app, even in a simple case
> where you didn't care about PCI vs PCIe.
> 
> 
> In the end, my opinion is that the job of creating a proper PCIe topology
> and assigning devices to the correct address in that topology is a
> non-trivial job that should be implemented in one place. I think libvirt is
> in the proper place to do that, it just needs to be provided with a small
> amount of information from the user/management app so that libvirt won't be
> making any policy decisions, but just assuring that those policies are
> properly applied.

Ultimately what you are suggesting here is to stuff a massive amount of
VM configuration policy logic inside libvirt have a bunch of extra XML
elements which control how that policy works. When some application
finds some aspect of the policy that doesn't work for them, they'll have
to solve all of the hard problems themselves anyway, or live with a policy
that doesn't work right for them. Ultimately we'll end up adding countless
new XML knobs to libvirt whose sole purpose is to be inputs to ever more
complex policy inside libvirt.

We've had a long standing rule that libvirt avoids implementing policy
rules whereever possible, because it is inevitable that you'll never
satisfy all downstream consumers of libvirt. IMHO this rule has been
one of our best decisions and so I'm not in favour of abandoning that
now.

While we do have some existing PCI addressing logic, this was unavoidable
when we first added it, due to the need to move address assignment out of
QEMU and into libvirt, while maintaining back-compat for existing apps
using libvirt. This is now talking about adding logic for cases where we
have no corresponding logic in QEMU and no need for backcompat, so there
is not a compelling reason why this has to live in libvirt.

So I'm strong -1 on the entire approach in this patch series.

I think it needs to be re-worked so that we can expose the neccessary
information in domain capabilities to allow applications using libvirt
to implement their own policies.  This would also allow us to create
a shared library outside of libvirt which can implement some "standard"
policies which can be shared if apps so wish to have them eg this is
the roll libvirt-designer was intended to fill - provide an opinionated
policy driver API.  The attractive thing about using libvirt-designer
or an equivalent API layer outside libvirt, so that all the policy
control knobs will exist as actual APIs inside of extra XML elements..

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|