[libvirt] [PATCH 4/6] conf: new <hotplug require='yes|no'/> element for hotpluggable devices

Tue Aug 9 18:14:15 UTC 2016

On 08/09/2016 04:05 AM, Daniel P. Berrange wrote:
> On Mon, Aug 08, 2016 at 12:41:48PM -0400, Laine Stump wrote:
>> On 08/08/2016 04:56 AM, Laine Stump wrote:
>>> When faced with a guest device that requires a PCI address but doesn't
>>> have one manually assigned in the config, libvirt has always insisted
>>> (well... *tried* to insist) on auto-assigning an address that is on a
>>> PCI controller that supports hotplug. One big problem with this is
>>> that it prevents automatic use of a Q35 (or aarch64/virt) machine's
>>> pcie-root (since the PCIe root complex doesn't support hotplug).
>>>
>>> In order to promote simpler domain configs (more devices on pcie-root
>>> rather than on a pci-bridge), this patch adds a new sub-element to all
>>> guest devices that have a PCI address and support hotplug:
>>>
>>>     <hotplug require='no'/>
>>>
>>> For devices that have hotplug require='no', we turn off the
>>> VIR_PCI_CONNECT_HOGPLUGGABLE bit in the devFlags when searching for an
>>> available PCI address. Since pcie-root now allows standard PCI
>>> devices, this results in those devices being placed on pcie-root
>>> rather than pci-bridge.
>> I've been playing around with this and, by itself, it works very well. With
>> this solved, combined with taking advantage of PCIe for virtio when
>> available, it's very easy to create q35 domains that have no legacy-PCI
>> without needing to resort to manually assigning addresses.
>>
>> However, there is still another item that we need to be able to configure -
>> stating a preference of legacy PCI vs. PCIe when both are available for a
>> device (again, the aim is to do this *without* needing to manually assign an
>> address). The following devices have this choice:
>>
>> 1) vfio assigned devices
>> 2) virtio devices
>> 3) the nec-xhci USB controller
>>
>> You might think that it would always be preferable to use PCIe if it's
>> available, but especially in these "early days" of using PCIe in guests it
>> would be useful to have to ability to *easily* force use of a legacy PCI
>> slot in case some PCIe-related bug is encountered (in particular, people
>> have pointed out in discussions about vfio device assignment that it could
>> be possible for a guest OS to misbehave when presented with a device's PCIe
>> configuration block (which hasn't been visible in the past because the
>> device was attached to a legacy PCI slot)).
>>
>> In order maintain functionality while any such bugs are figured out and
>> fixed, we need to be able to force the device onto a PCI slot. There are two
>> ways of doing this:
>>
>> 1) manually specify the full PCI address of a legacy PCI slot in the config
>> 2) provide an option in the config that simply says "use any PCI slot" or
>> "use any PCIe slot".
>> Assuming that (1) is too cumbersome, we need to come up with a reasonable
>> name/location for a config option (providing the backend for it will be
>> trivial). Some possible places:
> I prefer a variant on (1), which is to specifcy an address with only the
> controller index filled out. eg given a q35 bridge topology
>
>      <controller type='pci' index='0' model='pcie-root'/>
>      <controller type='pci' index='1' model='dmi-to-pci-bridge'>
>        <model name='i82801b11-bridge'/>
>      </controller>
>      <controller type='pci' index='2' model='pci-bridge'>
>        <model name='pci-bridge'/>
>        <target chassisNr='56'/>
>      </controller>
>
> A device would use
>
>     <address type="pci" controller="2">   (for pci-bridge placement)
>     <address type="pci" controller="0">   (for pcie-root placement)
>
>
> This trivially expaneds to cover the NUMA use case too, without having to
> invent further elements duplicating NUMA node info again.
>
>
>      <controller type='pci' index='0' model='pci-root'/>
>      <controller type='pci' index='1' model='pci-expander-bus'>
>        <model name='pxb'/>
>        <target busNr='254'>
>          <node>0</node>
>        </target>
>      </controller>
>      <controller type='pci' index='2' model='pci-expander-bus'>
>        <model name='pxb'/>
>        <target busNr='255'>
>          <node>1</node>
>        </target>
>      </controller>
>
> eg device uses
>
>     <address type="pci" controller="1">   (for pxb on NUMA node 0)
>     <address type="pci" controller="2">   (for pxb on NUMA node 1)
>
>
> Yes, when you first boot a guest, this means the mgmt app has to know
> what controllers exist by default, and/or specify controllers,

It also has to know the rules about which controllers support what type 
of devices, and which controllers can plug into which other controllers 
- the full topology of PCI controllers will need to be spelled out 
specifically in advance, so that the management app can give the index 
number of the correct controller.

As it stands now, some amount of this is already necessary, just because 
libvirt currently will only automatically add pci-bridge devices when 
there aren't enough PCI slots available (the code to automatically add 
pcie controllers as necessary is yet to be written). Still, this is all 
it takes to get 8 properly connected downstream-ports:

<controller type='pci' model='pcie-root-port'/>
<controller type='pci' model='pcie-switch-upstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>
<controller type='pci' model='pcie-switch-downstream-port'/>

(note that you don't need to say where they are connected, nor do you 
need to give an index to each one. As long as there is an available slot 
of the correct type for each controller as it's encountered, libvirt 
will hook them up and index them correctly)

If the controller number was needed to specify in the device address, 
then not only would each controller need to be manually given an index 
(if you were specifying the entire domain in a single go), but each 
device would need to be given a unique setting - instead of just saying 
"I want one of these" (in the future, if there isn't "one of those" 
available, everything necessary to provide it will be automatically 
created), you have to say "I want *this* one, implying that you know 
everything about what "this one" provides (and you will need to manually 
setup "this one" and everything else required to provide it).

Also, note that since each pci-root-port and pci-switch-downstream-port 
has only a single slot, in those cases this:

    <address type="pci" controller="1"/>

is really just:

    <address type="pci" bus="1"/>

and for the other cases (pci[e]-root, pci-bridge, dmi-to-pci-bridge) you 
will still have to keep track of how many devices you've assigned to a 
particular controller, to make sure that your request doesn't overflow 
the 31 (or 32) device limit (which brings up another problem of doing it 
this way - different controllers have different numbers of useful slots, 
so the management app will have to know about that too).

>   but I
> think that's ultimately preferrable than inventing an indefinitely
> growing list of extra XML elements to provide tweaks to teh PCI address
> assginment logic.
>
> We can simplify life of mgmt apps in a different way, but using the domain
> XML capabilities to provide full data on the default controllers used for
> each machine type.

And pcie/pci info on every endpoint device (I agree with your assertion 
that we should treat every device as potentially capable of hotplug, so 
we don't need that info for endpoints).

>   So apps would not need to have any machine type specific
> logic in them - they can write code that's entirely metadata driven based
> on the domain capabilities.

The metadata would need to include:

1) upstream connection type (each type of PCI controller needs a different
     connection type. I've learned through all this that each controller 
has different
     rules about what it can be connected to. So far there are 9 
different types,
     including pci-endpoint and pcie-endpoint)
2) accepted downstream connection types
3) min and max slot (currently known possibilities: 0-0, 0-31, or 0-32)
4) hotplug supported or not (yes, the management app really would need
      to know about this, even if they intended for all devices to be 
hotpluggable
      - they need to be able to avoid pcie-root and dmi-to-pci-bridge)

Then the management app would need to keep track of which addresses are 
in use for each controller, so that it would be able to avoid putting 
too many devices on any controller.

Also, the management app will now need to know for *every* device 
whether that device in this particular version of qemu is a PCIe device, 
a legacy PCI device, or one of the silly "dual mode" devices that 
changes according to the type of bus it's plugged into. Otherwise it 
won't know which controller it should plug the device into.

Finally, the management app would need to add logic to grow the bus 
topology correctly as needed. Sometimes this is simple and sometimes 
not. For example, if there is an open slot available on pcie-root, you 
can get another hotpluggable pcie slot by simply adding a 
pcie-root-port. If not, then you need to find an open 
pcie-switch-downstream-port or pcie-root-port, then connect a 
pcie-switch-upstream-port to that, and connect one or more 
pcie-switch-downstream-ports to that. (note this implies that you should 
always maintain at least one unoccupied root-port, downstream-port, or 
an open slot on pcie-root).

The whole point of my exercise has been to 1) get rid of as much legacy 
PCI baggage as possible, and 2) do this in a way that 
prevents/eliminates the need for multiple management apps to 
re-implement and maintain essentially the same code (especially since my 
trip through PCIe-land has shown me that it is fraught with lack of 
documentation (about the controller implementations, not the PCIe spec), 
misinformation (changing over time), misconceptions, and lots of 
annoying little details). I started out kind of liking your alternate 
idea, since it seemed to accomplish that in a simpler way that didn't 
require adding yet another knob later. But after thinking it through, I 
think that it provides very little extra functionality, and would 
require a lot of extra work for the management app, even in a simple 
case where you didn't care about PCI vs PCIe.

In the end, my opinion is that the job of creating a proper PCIe 
topology and assigning devices to the correct address in that topology 
is a non-trivial job that should be implemented in one place. I think 
libvirt is in the proper place to do that, it just needs to be provided 
with a small amount of information from the user/management app so that 
libvirt won't be making any policy decisions, but just assuring that 
those policies are properly applied.