[libvirt] new libvirt "pci" controller type and pcie/q35 (was Re: [PATCH 4/7] add pci-bridge controller type)

Laine Stump laine at laine.org
Tue Apr 16 16:05:40 UTC 2013


On 04/15/2013 05:58 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 15, 2013 at 11:27:03AM -0600, Alex Williamson wrote:
>> On Fri, 2013-04-12 at 11:46 -0400, Laine Stump wrote:
>>> On 04/11/2013 07:23 AM, Michael S. Tsirkin wrote:
>>>> On Thu, Apr 11, 2013 at 07:03:56AM -0400, Laine Stump wrote:
>>>>> On 04/10/2013 05:26 AM, Daniel P. Berrange wrote:
>>>>>> On Tue, Apr 09, 2013 at 04:06:06PM -0400, Laine Stump wrote:
>>>>>>> On 04/09/2013 04:58 AM, Daniel P. Berrange wrote:
>>>>>>>> On Mon, Apr 08, 2013 at 03:32:07PM -0400, Laine Stump wrote:
>>>>>>>> Actually I do wonder if we should reprent a PCI root as two
>>>>>>>> <controller> elements, one representing the actual PCI root
>>>>>>>> device, and the other representing the host bridge that is
>>>>>>>> built-in.
>>>>>>>>
>>>>>>>> Also we should use the actual model names, not 'pci-root' or
>>>>>>>> 'pcie-root' but rather i440FX for "pc" machine type, and whatever
>>>>>>>> the q35 model name is.
>>>>>>>>
>>>>>>>>  - One PCI root with built-in PCI bus (ie todays' setup)
>>>>>>>>
>>>>>>>>    <controller type="pci-root" index="0">
>>>>>>>>      <model name="i440FX"/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci" index="0"> <!-- Host bridge -->
>>>>>>>>      <address type='pci' domain='0' bus='0' slot='0''/>
>>>>>>> Isn't this saying that the bridge connects to itself? (since bus 0 is
>>>>>>> this bus)
>>>>>>>
>>>>>>> I understand (again, possibly wrongly) that the builtin PCI bus connects
>>>>>>> to the chipset using its own slot 0 (that's why it's reserved), but
>>>>>>> that's its address on itself. How is this bridge associated with the
>>>>>>> pci-root?
>>>>>>>
>>>>>>> Ah, I *think* I see it - the domain attribute of the pci controller is
>>>>>>> matched to the index of the pci-root controller, correct? But there's
>>>>>>> still something strange about the <address> of the pci controller being
>>>>>>> self-referential.
>>>>>> Yes, the index of the pci-root matches the 'domain' of <address>
>>>>> Okay, then the way that libvirt differentiates between a pci bridge that
>>>>> is connected to the root, and one that is connected to a slot of another
>>>>> bridge is 1) the "bus" attribute of the bridge's <address> matches the
>>>>> "index" attribute of the bridge itself, and 2) "slot" is always 0. Correct?
>>>>>
>>>>> (The corollary of this is that if slot == 0 and bus != index, or bus ==
>>>>> index and slot != 0, it is a configuration error).
>>>>>
>>>>> I'm still unclear on the usefulness of the pci-root controller though -
>>>>> all the necessary information is contained in the pci controller, except
>>>>> for the type of root. But in the case of pcie root, I think you're not
>>>>> allowed to connect a standard bridge to it, only a "dmi-to-pci-bridge"
>>>>> (i82801b11-bridge)
>>>> Yes you can connect a pci bridge to pcie-root.
>>>> It's represented as a root complex integrated device.
>> Is this accurate?  Per the PCI express spec, any PCI express device
>> needs to have a PCI express capability, which our pci-bridge does not.
>> I think this is one of the main differences for our i82801b11-bridge,
>> that it exposes itself as a root complex integrated endpoint, so we know
>> it's effectively a PCIe-to-PCI bridge.
> If it does not have an express link upstream it's not a
> PCIe-to-PCI bridge, is it?


To my untrained ear it sounds like you're disagreeing with yourself ???


>>  We'll be asking for trouble
>> if/when we get guest IOMMU support if we are lax about using PCI-to-PCI
>> bridges where we should have PCIe-to-PCI bridges.
> I recall the spec saying somewhere that integrated endpoints are outside
> the root complex hierarchy.  I think IOMMU will simply not apply to
> these.


Correct me if I'm wrong - I think libvirt can ignore this bit of debate
other than to use its result to determine which devices are allowed to
connect to which other devices, right?



>> There are plenty of
>> examples to the contrary of root complex integrated endpoints without an
>> express capability, but that doesn't make it correct to the spec.
> Is there something in the spec explicitly forbidding this?  I merely
> find: The PCI Express Capability structure is required for PCI Express
> device Functions.
> So if it's not an express device it does not have to have
> an express capability?
>
> Maybe we should send an example dump to pci sig and ask them...
>
>>> ARGHH!! Just when I think I'm starting to understand *something* about
>>> these devices...
>>>
>>> (later edit: after some coaching on IRC, I *think* I've got a bit better
>>> handle on it.)


(But I guess not good enough :-P)


>>>
>>>>>>>>    </controller>
>>>>>>>>    <interface type='direct'>
>>>>>>>>       ...
>>>>>>>>      <address type='pci' domain='0' bus='0' slot='3'/>
>>>>>>>>    </controller>
>>>>>>>>
>>>>>>>>  - One PCI root with built-in PCI bus and extra PCI bridge
>>>>>>>>
>>>>>>>>    <controller type="pci-root" index="0">
>>>>>>>>      <model name="i440FX"/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci" index="0"> <!-- Host bridge -->
>>>>>>>>      <address type='pci' domain='0' bus='0' slot='0'/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci" index="1"> <!-- Additional bridge -->
>>>>>>>>      <address type='pci' domain='0' bus='0' slot='1'/>
>>>>>>>>    </controller>
>>>>>>>>    <interface type='direct'>
>>>>>>>>       ...
>>>>>>>>      <address type='pci' domain='0' bus='1' slot='3'/>
>>>>>>>>    </controller>
>>>>>>>>
>>>>>>>>  - One PCI root with built-in PCI bus, PCI-E bus and and extra PCI bridge
>>>>>>>>    (ie possible q35 setup)
>>>>>>> Why would a q35 machine have an i440FX pci-root?
>>>>>> It shouldn't, that's a typo
>>>>>>
>>>>>>>>    <controller type="pci-root" index="0">
>>>>>>>>      <model name="i440FX"/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci" index="0"> <!-- Host bridge -->
>>>>>>>>      <address type='pci' domain='0' bus='0' slot='0'/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci" index="1"> <!-- Additional bridge -->
>>>>>>>>      <address type='pci' domain='0' bus='0' slot='1'/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci" index="1"> <!-- Additional bridge -->
>>>>>>>>      <address type='pci' domain='0' bus='0' slot='1'/>
>>>>>>>>    </controller>
>>>>>>> I think you did a cut-paste here and intended to change something, but
>>>>>>> didn't - those two bridges are identical.
>>>>>> Yep, the slot should be 2 in the second one
>>>>>>
>>>>>>>>    <interface type='direct'>
>>>>>>>>       ...
>>>>>>>>      <address type='pci' domain='0' bus='1' slot='3'/>
>>>>>>>>    </controller>
>>>>>>>>
>>>>>>>> So if we later allowed for mutiple PCI roots, then we'd have something
>>>>>>>> like
>>>>>>>>
>>>>>>>>    <controller type="pci-root" index="0">
>>>>>>>>      <model name="i440FX"/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci-root" index="1">
>>>>>>>>      <model name="i440FX"/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci" index="0"> <!-- Host bridge 1 -->
>>>>>>>>      <address type='pci' domain='0' bus='0' slot='0''/>
>>>>>>>>    </controller>
>>>>>>>>    <controller type="pci" index="0"> <!-- Host bridge 2 -->
>>>>>>>>      <address type='pci' domain='1' bus='0' slot='0''/>
>>>>>>>>    </controller>
>>>
>>> There is a problem here - within a given controller type, we will now
>>> have the possibility of multiple controllers with the same index - the
>>> differentiating attribute will be in the <address> subelement, which
>>> could create some awkwardness. Maybe instead this should be handled with
>>> a different model of pci controller, and we can add a "domain" attribute
>>> at the toplevel rather than specifying an <address>?
>> On real hardware, the platform can specify the _BBN (Base Bus Number =
>> bus) and the _SEG (Segment = domain) of the host bridge.  So perhaps you
>> want something like:
>>
>> <controller type="pci-host-bridge">
>>   <model name="i440FX"/>
>>   <address type="pci-host-bridge-addr" domain='1' bus='0'/>
>> </controller>


The <address> element is intended to specify where a device or
controller is connected *to*, not what bus/domain it *provides*. I think
you're intending for this to provide domain 1 bus 0, so according to
existing convention, you would want that information in the <controller>
element attributes (e.g. for all other controller types, the generic
"index" attribute is used to indicate a bus number when such a thing is
appropriate for that type of controller).

Anyway, I've simplified this a bit in my latest iteration - there are no
separate "root" and "root bus" controllers, just a "pci-root" (for
i440FX) or "pcie-root" (for q35), both of which provide a "pci" bus (I'm
using the term loosely here), each with different restrictions about
what can be connected.


> Yes, we could specify segments, though it's not the same as
> a domain as linux guests define it (I assume this is what libvirt wants
> to call a domain): if memory serves a segment does not have to be a root
> based hierarchy, linux domains are all root based.


I'm not exactly sure of the meanings/implications of all those terms,
but from the point of view of libvirt, as long as we can represent all
possible connections between devices using the domain:bus:slot.function
notation, I think it doesn't matter too much.


> We are better off not specifying BBN for all buses I think -


How would you differentiate between the different buses without some
sort of identifier?


> it's intended for multi-root support for legacy OSes.
>
>> "index" is confusing to me.


index is being used just because that's been the convention for other
controller types - when there are multiple controllers of the same type,
each is given an index, and that's used in the "child" devices to
indicate which of the parent controllers they connect to.


> I'd prefer ID for bus not a number, I'm concerned users will
> assume it's bus number and get confused by a mismatch.

So you would rather that they were something like this?

<controller type='pci' bus='pci.0'>
  <model type='pci-root'/>
</controller>
<interface type='blah'>
  ...
  <address type='pci' domain='0' bus='pci.0' slot='0' function='0'/>
</interface>

The problem is that the use of numeric bus IDs is fairly deeply
ingrained in libvirt; every existing libvirt guest config has device
addresses specifying "bus='0'" Switching to using an alphanumeric ID
rather than a simple number would require extra care to maintain
backward compatibility with all those existing configs and previous
versions of libvirt that might end up being the recipient of xml
generated by a newer libvirt. Because of this, at the very least the
pci.0 bus must be referred to as bus='0'; once we've done that, we might
as well refer to them *all* numerically (anyway, even if names were
allowed, I'm sure everybody would just call them '1', '2', (or at the
very most "pci.1", "pci.2") etc. anyway.


>>>>>>>>    <interface type='direct'> <!-- NIC on host bridge 2 -->
>>>>>>>>       ...
>>>>>>>>      <address type='pci' domain='1' bus='0' slot='3'/>
>>>>>>>>    </controller>
>>>>>>>>
>>>>>>>>
>>>>>>>> NB this means that 'index' values can be reused against the
>>>>>>>> <controller>, provided they are setup on different pci-roots.
>>>>>>>>
>>>>>>>>> (also note that it might happen that the bus number in libvirt's config
>>>>>>>>> will correspond to the bus numbering that shows up in the guest OS, but
>>>>>>>>> that will just be a happy coincidence)
>>>>>>>>>
>>>>>>>>> Does this make sense?
>>>>>>>> Yep, I think we're fairly close.
>>>>>>> What about the other types of pci controllers that are used by PCIe? We
>>>>>>> should make sure they fit in this model before we settle on it.
>>>>>> What do they do ?
>>> (The descriptions of different models below tell what each of these
>>> other devices does; in short, they're all just some sort of electronic
>>> Lego to help connect PCI and PCIe devices into a tree).
>>>
>>> Okay, I'll make yet another attempt at understanding these devices, and
>>> suggesting how they can all be described in the XML. I'm thinking that
>>> *all* of the express hubs, switch ports, bridges, etc can be described
>>> in xml in the manner above, i.e.:
>>>
>>>    <controller type='pci' index='n'>
>>>      <model type='xxx'/>
>>>    </controller>
>>>
>>> and that the method for connecting a device to any of them would be by
>>> specifying:
>>>
>>>      <address type='pci' domain='n' bus='n' slot='n' function='n'/>
>>>
>>> Any limitations about which devices/controllers can connect to which
>>> controllers, and how many devices can connect to any particular
>>> controller will be derived from the <model type='xxx'/>. (And, as we've
>>> said before, although qemu doesn't assign each of these controllers a
>>> numeric bus id, and although we can make no guarantee that the bus id we
>>> use for a particular controller is what will be used by the guest
>>> BIOS/OS, it's still a convenient notation and works well with other
>>> hypervisors as well as qemu. I'll also note that when I run lspci on an
>>> X58-based machine I have here, *all* of the relationships between all
>>> the devices listed below are described with simple bus:slot.function
>>> numbers.)
>>>
>>> Here is a list of the pci controller model types and their restrictions
>>> (thanks to mst and aw for repeating these over and over to me; I'm sure
>>> I still have made mistakes, but at least it's getting closer).
>>>
>>>
>>> <controller type='pci-root'>
>>> ============================
>>>
>>> Upstream:         nothing
>>> Downstream:       only a single pci-root-bus (implied)
>>> qemu commandline: nothing (it's implied in the q35 machinetype)
>>>
>>> Explanation:
>>>
>>> Each machine will have a different controller called "pci-root" as
>>> outlined above by Daniel. Two types of pci-root will be supported:
>>> i440FX and q35. If a pci-root is not spelled out in the config, one will
>>> be auto-added (depending on machinetype).
>>>
>>> An i440FX pci-root has an implicitly added pci-bridge at 0:0:0.0 (and
>>> any bridge that has an address of slot='0' on its own bus is, by
>>> definition, connected to a pci-root controller - the two are matched by
>>> setting "domain" in the address of the pci-bridge to "index" of the
>>> pci-root). This bridge can only have PCI devices added.
>>>
>>> A q35 pci-root also implies a different kind of pci-bridge device - one
>>> that can only have PCIe devices/controllers attached, but is otherwise
>>> identical to the pci-bridge added for i440FX. This bus will be called
>>> "root-bus" (Note that there are generally followed conventions for what
>>> can be connected to which slot on this bus, and we will probably follow
>>> those conventions when building a machine, *but* we will not hardcode
>>> this convention into libvirt; each q35 machine will be an empty slate)
>>>
>>>
>>> <controller type='pci'>
>>> =======================
>>>
>>> This will be used for *all* of the following controller devices
>>> supported by qemu:
>>>
>>> <model type='pcie-root-bus'/> (implicit/integrated)
>>> ----------------------------
>>>
>>> Upstream:         connect to pci-root controller *only*
>>> Downstream:       32 slots, PCIe devices only, no hotplug.
>>> qemu commandline: nothing (implicit in the q35-* machinetype)
>>>
>>> This controller is the bus described above that connects to a q35's
>>> pci-root, and provides places for PCIe devices to connect. Examples are
>>> root-ports, dmi-to-pci-bridges sata controllers, integrated
>>> sound/usb/ethernet devices (do any of those that can be connected to the
>>> pcie-root-bus exist yet?).
>>>
>>> There is only one of these controllers, and it will *always* be
>>> index='0', and will always have the following address:
>>>
>>>   <address type='pci' domain='0' bus='0' slot='0' function='0'/>
>> Implicit devices make me nervous, why wouldn't this just be a pcie-root
>> (or pcie-host-bridge)?  If we want to support multiple host bridges,
>> there can certainly be more than one, so the index='0' assumption seems
>> to fall apart.


That's when we need to start talking about a "domain" attribute, like this:

   <controller type='pci' domain='1' index='0'>
     <model type='pcie-root-bus'/>
   </controller>


>>> <model type='root-port'/> (ioh3420)
>>> -------------------------
>>>
>>> Upstream:         PCIe, connect to pcie-root-bus *only* (?)
>> yes
>>
>>> Downstream:       1 slot, PCIe devices only (?)
>> yes
>>
>>> qemu commandline: -device ioh3420,...
>>>
>>> These can only connect to the "pcie-root-bus" of of a q35 (implying that
>>> this bus will need to have a different model name than the simple
>>> "pci-bridge"
>>>
>>>
>>> <model type='dmi-to-pci-bridge'/> (i82801b11-bridge)
>> I'm worried this name is either too specific or too generic.  What
>> happens when we add a generic pcie-bridge and want to use that instead
>> of the i82801b11-bridge?  The guest really only sees this as a
>> PCIe-to-PCI bridge, it just happens that on q35 this attaches at the DMI
>> port of the MCH.


Hehe. Just using the name you (Alex) suggested :-)

My use of the "generic" device *type* names rather than exact hardware
model names is based on the idea that any given machinetype will have a
set of these "building block" devices available, and as long as you use
everything from the same "set" on a given machine, it doesn't really
matter which set you use. Is this a valid assumption?


>>
>>> ---------------------------------
>>>
>>> (btw, what does "dmi" mean?)
>> http://en.wikipedia.org/wiki/Direct_Media_Interface
>>
>>> Upstream:         pcie-root-bus *only*
>> And only to a specific q35 slot (1e.0) for the i82801b11-bridge.
>>
>>> Downstream:       32 slots, any PCI device, no hotplug (?)
>> Yet, but I think this is where we want to implement ACPI based hotplug.


Okay, but for now libvirt can just refrain from auto-addressing any
user-created devices to that bus; we'll just make sure that there is
always a "pci-bridge" plugged into it, and auto-addressed devices will
all be put there.

In the meantime if someone explicitly addresses a device to connect to
the i82801b11-bridge, we'll let them do it, but if they try to
hot-unplug it they will get an error.


>>
>>> qemu commandline: -device i82801b11-bridge,...
>>>
>>>
>>> <model type='upstream-switch-port'/> (x3130-upstream)
>>> ------------------------------------
>>>
>>> Upstream:         PCIe, connect to pcie-root-bus, root-port, or
>>> downstream-switch-port (?)
>> yes
>>
>>> Downstream:       32 slots, connect *only* to downstream-switch-port
>> I can't verify that there are 32 slots, mst?  I've only setup downstream
>> ports within slot 0.


According to a discussion with Don Dutile on IRC yesterday, the
downstream side of an upstream-switch-port has 32 "slots" with 8
"functions" each, and each of these functions can have a
downstream-switch-port connected. That said, he told me that in every
case he's seen in the real world, all the downstream-switch-ports were
connected to "function 0", effectively limiting it to 32
downstreams/upstream.


>>
>>> qemu-commandline: -device x3130-upstream
>>>
>>>
>>> This is the upper side of a switch that can multiplex multiple devices
>>> onto a single port. It's only useful when one or more downstream switch
>>> ports are connected to it.
>>>
>>> <model type='downstream-switch-port'/> (xio3130-downstream)
>>> --------------------------------------
>>>
>>> Upstream:         connect *only* to upstream-switch-port
>>> Downstream:       1 slot, any PCIe device
>>> qemu commandline: -device xio3130-downstream
>>>
>>> You can connect one or more of these to an upstream-switch-port in order
>>> to effectively plug multiple devices into a single PCIe port.
>>>
>>> <model type='pci-bridge'/> (pci-bridge)
>>> --------------------------
>>>
>>> Upstream:         PCI, connect to 1) pci-root, 2) dmi-to-pci-bridge, 3)
>>> another pci-bridge
>>> Downstream:       any PCI device, 32 slots
>>> qemu commandline: -device pci-bridge,...
>>>
>>> This differs from dmi-to-pci-bridge in that its upstream connection is
>>> PCI rather than PCIe (so it will work on an i440FX system, which has no
>>> root PCIe bus) and that hotplug is supported. In general, if a guest
>>> will have any PCI devices, one of these controllers should be added, and
>>>
>>> ===============================================================
>>>
>>>
>>> Comment: I'm not quite convinced that we really need the separate
>>> "pci-root" device. Since 1) every pci-root will *always* have either a
>>> pcie-root-bus or a pci-bridge connected to it, 2) the pci-root-bus will
>>> only ever be connected to the pci-root, and 3) the pci-bridge that
>>> connects to it will need special handling within the pci-bridge case
>>> anyway, why not:
>>>
>>> 1) eliminate the separate pci-root controller type
>>>
>>> 2) within <controller type='pci'>, a new <model type='pci-root-bus'/>
>>> will be added.
>>>
>>> 3) a pcie-root-bus will automatically be added for q35 machinetypes, and
>>> pci-root-bus for any machinetype that supports a PCI bus (e.g. "pc-*")
>>>
>>> 4) model type='pci-root-bus' will behave like pci-bridge, except that it
>>> will be an implicit device (nothing on qemu commandline) and it won't
>>> need an <address> element (neither will pcie-root-bus).
>> I think they should both have a domain + bus address to make it possible
>> to build multi-domain/multi-host bridge systems.  They do not use any
>> slots through.


Yes. I think I agree with that. But we don't have to implement the
multiple-domain stuff today (since qemu doesn't support it yet), and
when we do, I think we can just add a "domain" attribute to the main
element of pci-root and pcie-root controllers.


>>> 5) to support multiple domains, we can simply add a "domain" attribute
>>> to the toplevel of controller.
>>>
>> Or this Wouldn't even be unnecessary if we supported a 'pci-root-addr'
>> address type for the above with the default being domain=0, bus=0?  I
>> suppose it doesn't matter whether it's a separate attribute or new
>> address type though.  Thanks,

I think you're mixing up the purpose of the <address> element vs the "index" attribute in the main <controller> element. To clarify, take this example:


    <controller type='pci' index='3'>
      <model type='pci-bridge'/>
      <address domain='0' bus='1' slot='9' function='0'/>
    </controller>

This controller is connected to slot 9 of the already-existing bus 1. It provides a bus 3 for other devices to connect to. If we wanted to start up a domain 1, we would do something like this:

    <controller type='pci' domain='1' index='0'>
      <model type='pci-root'/>
    </controller>

This would give us a PCI bus 0 in domain 1. You could then connect a pci-bridge to it like this:


    <controller type='pci' domain='1' index='1'>
      <model type='pci-bridge'/>
      <address type='pci' domain='1' bus='0' slot='1' function='0'/>
    </controller>

The <address> tells us that this new bus connects to slot 1 of PCI bus 0 in domain 1. The <controller domain='1' index='1'> tells us that there is now a new bus other devices can connect to that is at domain='1' bus='1'.


> Also AFAIK there's nothing in the spec that requires bus=0
> to be root. The _BBN hack above is used sometimes to give !=0
> bus numbers to roots.

I don't really understand that, but do you think that 1) qemu would ever
want/be able to model that, or that 2) anyone would ever have a
practical reason for wanting to? It's really cool and all to be able to
replicate any possible esoteric hardware configuration in a virtual
machine, but it seems like the only practical use of replicating
something like that would be for someone wanting to test what their OS
does when there's no domain=0 in the hardware...




More information about the libvir-list mailing list