[libvirt] [PATCH 00/13] PCIe fixes + new PCI controllers w/RFC

Laine Stump laine at laine.org
Fri Jun 26 15:54:25 UTC 2015


On 06/25/2015 02:44 AM, Martin Kletzander wrote:
> On Wed, Jun 24, 2015 at 12:04:33PM -0400, Laine Stump wrote:
>> I think the only votes were for option 1 and 4 (interesting how the only
>> ones that were chosen were those that I *didn't* pick personally :-).
>> See comments below. In the meantime the other issue Alex pointed out may
>> cause this to take a slightly different direction.
>>
>> On 06/22/2015 02:44 PM, Laine Stump wrote:
>>> ===
>>> Idea 1: multiplex the meaning of the "model" attribute, so we
>>> currently have:
>>>
>>>   <controller type='pci' model='dmi-to-pci-bridge'/>
>>>
>>> which means "add an i82801b11-bridge device" and when we add a generic
>>> version of this type of controller, we would do it with something like:
>>>
>>>   <controller type='pci' model='generic-dmi-to-pci-bridge'/>
>>>
>>> and for another vendor's mythical controller:
>>>
>>>   <controller type='pci' model='xyz-dmi-to-pci-bridge'/>
>>>
>>> Cons: This will make for ugliness in switch statements where a new
>>> case will have to be added whenever a different controller with
>>> similar behavior/usage is supported. And it's generally not a good
>>> idea to
>>> have a single attribute be used for two different functions.
>>
>> jtomko advocated this one because it would make it easier for an older
>> libvirt to notice an unsupported class+model of controller during a
>> migration attempt from a newer libvirt. An example would be if a newer
>> libvirt had a guest with
>>
>>    <controller type='pci' model='xyz'/>
>>
>> (where xyz is some qemu device that implements a dmi-to-pci-bridge)
>>
>> That would fail on an attempt to migrate to older libvirt, but
>>
>>   <controller type='pci' model='dmi-to-pci-bridge'/>
>>
>> would succeed. On the other hand, if we did this:
>>
>>   <controller type='pci' model='dmi-to-pci-bridge submodel='xyz'/>
>>
>> that would succeed (and create the wrong device) because submodel would
>> be ignored.
>>
>> Since there is currently no alternate implementation of a
>> dmi-to-pci-bridge available in qemu, and it will likely be awhile until
>> that happens, I think if we start filling out the XML now as:
>>
>>   <controller type='pci' model='dmi-to-pci-bridge'
>>                          submodel='i82801b11-bridge'/>
>>
>> by the time we get to the point that we do have an alternate 'xyz'
>> controller, any version of libvirt that doesn't understand "submodel"
>> will be so far in the past that we wouldn't want to be able to migrate
>> back to it anyway.
>>
>>
>>> ===
>>> Idea 4: Unlike other uses of "model" in libvirt, for pci controllers,
>>> continue to use "model" to denote the subtype/class/whatever of
>>> controller, and create a new attribute to denote the different
>>> specific implementations of each one. So for example:
>>>
>>> [4]  <controller type='pci' model='dmi-to-pci-bridge'/>
>>>
>>> would become:
>>>
>>> [5]  <controller type='pci' model='dmi-to-pci-bridge'
>>>                             implementation='i82801b11-bridge'/>
>>>
>>> (or some other name in place of "implementation" - ideas? I'm horrible
>>> at thinkin up names)
>>>
>>> Pros: wouldn't create compatibility problems when downgrading or
>>> migrating cross version.
>>>
>>> Cons: Is inconsistent with every other use of "model" attribute in
>>> libvirt, and each new addition of a PCI controller further propagates
>>> this misuse.
>>>
>>
>> mkletzan preferred this one, and danpb was amenable to it in IRC. I
>> think I now agree, but Alex's comments about needing to keep track of
>> what we put in the qemu commandline for port and chassis have me
>> thinking this isn't enough.
>>
>> The problem is that the "ioh3420" device needs its "port" and "chassis"
>> options set; Alex recommends setting port to:
>>
>>   (slot << 3)+function
>>
>> Likewise, he recommends setting "port" for the xio3130-downstream device
>> to the same value as slot. So, for the following:
>>
>>   <controller type='pci model='pci-root-port' index='3'>
>>     <address type='pci' bus='0' slot='4' function='1'>
>>   </controller>
>>
>> we would end up with the following commandline:
>>
>>   -device ioh3420,chassis=3,port=0x21,id='pcie.3',bus='pcie.0',
>>           addr=0x4.0x1
>>
>> (chassis is the same as "index" in the xml (again per Alex's
>> recommendation), and port is (4 << 3) + 1 = 0x21)
>>
>> *However* he also says that we may change our mind on these in the
>> future, so chassis and port may end up being something different. Since
>> those values are visible to the guest, we can't allow them to change on
>> an existing machine, as it breaks the guest ABI. So we need to store
>> them in the XML. They aren't part of the PCI address though, they are
>> options. So I need to figure out the best way to represent that in the
>> XML, and fill it in when it is auto-generated when the controller is
>> defined. A few possible ways to solve both problems at once:
>>
>> [1] <controller type='pci model='pci-root-port' index='3'>
>>      <address type='pci' bus='0' slot='4' function='1'>
>>      <target type='ioh3420' chassis='3' port='0x21'/>
>>    </controller>
>>
>> Precedent: <target> is used to store the port number for serial/console
>> devices, specify guest-side bus and device name for disks, specify
>> guest-side mount location for filesystems, specify the *host*-side name
>> of tap devices for network interfaces, memory size for <memory>. So it
>> seems kind of proper. (slight variation - <target model='ioh3420' .../>
>> :-/ )
>>
>> [2] <controller type='pci model='pci-root-port' index='3'>
>>      <address type='pci' bus='0' slot='4' function='1'>
>>      <model type='ioh3420' chassis='3' port='0x21'/>
>>    </controller>
>>
>> (How's *that* for confusing?!? Both a model attribute *and* a model
>> subelement.)
>>
>> Precedent: This is exactly patterned after the use of <model> in <video>
>> devices though, where the specific card being emulated along with any
>> memory/etc options are put in <model>, so at least there is precedence)
>>
>> [3] <controller type='pci model='pci-root-port' index='3'>
>>      <address type='pci' bus='0' slot='4' function='1'>
>>      <device type='ioh3420' chassis='3' port='0x21'/>
>>    </controller>
>>
>> (This one is completely new, just for a fresh start in case people think
>> neither of the others are exactly right)
>>
>> At this point you can see that it all comes down to what name we want to
>> give the subelement; within that, we give the exact name of the qemu
>> device, and the exact name/value of any qemu options that we set to
>> non-default values.
>>
>> Somebody *please* have an opinion on the name for this, because none of
>> these strikes me as better or worse (well, I think I dislike <driver>
>>
>
> To be honest, I kinds dislike all of them.  Not that they would be
> chosen poorly, no, it's simply because the good sensible choice is
> unavailable due to another poor decision in the past (this may be
> another point for Michal's talk on KVM Forum).  Thinking about it I
> must say I don't like how target (which is supposed to match a place
> where the device appears for the guest) is used for the model
> specification, on the other hand (ab)using 'model' element for the
> specification of an "address" in guest (that's what I understand
> chassis and port are)

Actually they aren't an address. The address where the controller is
connected is completely specified in its <address> element, and the
address where other devices can connect to this controller is specified
by 1) the index of the controller and 2) implied (from the class of
controller) information about how many "slots" are available on this
controller and the starting slot#.

As it was described to me (by Alex in IRC and email) chassis,
chassis_nr, and port are just numbers that are stored in a register that
the guest can examine, and on real hardware they are used to learn, e.g.
the physical location of a connector. From libvirt's point of view, you
can end the previous sentence after the word "register". This is part of
the reason the proper solution isn't obvious/clear.

> doesn't feel any better.  What if we go with two
> of those elements?  Would that be too much pain?  E.g.:
>
>  <controller type='pci model='pci-root-port' index='3'>
>    <address type='pci' bus='0' slot='4' function='1'>
>    <model type='ioh3420'/>
>    <target chassis='3' port='0x21'/>
>  </controller>
>
> I understand this might look like an overkill, but I think it's better
> safe then sorry, I guess I just see us not so far in the future
> regretting any decision made now.

Well, another item that we will want to specify in the future is the
numa node of a pxb bridge (which I now understand as another kind of
"pci-bridge", *not* another kind of "pci-root" as I previously
believed). I just looked for where numa node info is set for other
devices, and see that for the <memory> device, it is set with the numa
*subelement* of the target subelement of <memory>. So:

  <memory ....>
    <target>
      <numa>1</numa>
    </target>
  <memory>

On the other hand, guest-side nume node for CPUs is set like this:

  <cpu>
    <numa>
      <cell id='0' cpus='0-3' memory='512000' unit='KiB'/>
      <cell id='1' cpus='4-7' memory='512000' unit='KiB'
memAccess='shared'/>
    </numa>
  </cpu>

so no use of <target> at all in that case.

I've tried to find a "smoking gun" item that would force me to use your
suggestion, but haven't yet.


So it comes down to this: Do we want to be more consistent with:

 <video>
   <model type='vga' vram='16384' heads='1'>
     <acceleration accel3d='yes' accel2d='yes'/>
   </model>
 </video>

(i.e. everything in <model>) or:

 <memory ...>
   <target>
     <size>blah</size>
     <node>1</node>
   </target>
   ...
 </memory>

and

 <disk ...>
   <target dev='hdb' bus='ide'/>
 </disk

???
> Another idea I briefly thought about, which would also deal with the
> migration to older libvirt versions was inventing a new address
> type ... and then I realized how stupid^Wugly was that.
>
> Anyway, I must say that I don't fully get the idea of how some of
> these controllers are supposed to be specified and what particular HW
> they should describe, so I might not had the best idea in town.

It's like a hybrid set of legos and duplos (and also some "leglos" or
"dupos" or something) mixed together - you can't just plug "anything"
into "anything". Each controller has limitations on what it can be
plugged into, what (and how many) can be plugged into it, and whether or
not each of the connections (both up and down) supports hotplug/unplug.

It's taken me quite awhile to get a decent model in my mind of how they
all fit together, and I still learn something almost daily that
invalidates some belief I've had.

>
> <rant> Mainly I don't understand why there's no pci switch.  Or is
> <controller type='pci'/> actually a pci switch (I usderstand it like
> that)?  Then why is there a downstream and upstream device when
> {down,up}stream should just indicate whether the port aims towards the
> root complex or the endpoints, respectively. </rant>

Yeah, that one bothers me too. The picture I have in my mind is that the
x3130-upstream is a switch with one upstream (pcie) port and 32
downstream (switch) ports, and that the xio3130-downstream device is a
"converter dongle" that has one upstream (switch) port and one
downstream (pcie) port. So the upstream port gives you a switch and a
bunch of downstream ports, but you can only plug one kind of thing into
them. I wanted to call the upstream one "pcie-switch" and the downstream
one "pcie-switch-port", but Alex (who is my go-to authority on all
things PCIe) said that seemed wrong to him, the "switch" is the
collection of one upstream and one or more downstream ports, and that he
really wanted the "upstream" and "downstream" in the names. That's why I
now have pcie-switch-upstream-port and pcie-switch-downstream-port. So
the way I think about it now is that the "switch" is this magical
nonexistent thing that sits between an upstream and a bunch of
downstreams. I still like the other way better, but am going with what
seems right to Alex because he's been dealing with these devices for
much longer than me so he almost certainly knows some nuances that I'm
unaware of.


>
>>
>>
>> (P.S. There are other "automagic" qemu options being specified by
>> libvirt that maybe could use the same treatment as these. Two that come
>> to mind are chassis_nr for pci-bridge controllers (set to the
>> controller's index), and "vectors" for virtio-net multiqueue support
>> (set to (queues*2)+2).
>
> I guess 'chassis_nr' is something exposed to guest, similarly to
> 'chassis' in the pci-root-port case, so that might be worth saving in
> the XML.  Maybe we can use 'chassis' in the target/model specification
> for storing the chassis_nr as a level of abstraction on libvirt level.

That thought crossed my mind, but then I thought of the possibility that
there may some day be a device that uses both chassis and chassis_nr at
the same time. I even have a vague memory of something like this
happening in the past (although it may have been somewhere totally
unrelated to libvirt). For that reason,

>
> But the number of vectors (to my knowledge) is not something exposed
> to guest and is only advised to being set according to the formula you
> wrote, above.

Ah, good point. We're safe on that one then.





More information about the libvir-list mailing list