[libvirt] [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default

Tue Dec 6 17:30:47 UTC 2016

On Fri, 2016-12-02 at 15:18 +1100, David Gibson wrote:
> > So, would the PCIe Root Bus in a pseries guest behave
> > differently than the one in a q35 or mach-virt guest?
> 
> Yes.  I had a long discussion with BenH and got a somewhat better idea
> about this.

Sorry, but I'm afraid you're going to have to break this
down even further for me :(

> If only a single host PE (== iommu group) is passed through and there
> are no emulated devices, the difference isn't too bad: basically on
> pseries you'll see the subtree that would be below the root complex on
> q35.
> 
> But if you pass through multiple groups, things get weird.

Is the difference between q35 and pseries guests with
respect to PCIe only relevant when it comes to assigned
devices, or in general? I'm asking this because you seem to
focus entirely on assigned devices.

> On q35,
> you'd generally expect physically separate (different slot) devices to
> appear under separate root complexes.

This part I don't get at all, so please bear with me.

The way I read it you're claiming that eg. a SCSI controller
and a network adapter, being physically separate and assigned
to separate PCI slots, should have a dedicated PCIe Root
Complex each on a q35 guest.

That doesn't match with my experience, where you would simply
assign them to separate slots of the default PCIe Root Bus
(pcie.0), eg. 00:01.0 and 00:02.0.

Maybe you're referring to the fact that you might want to
create multiple PCIe Root Complexes in order to assign the
host devices to separate guest NUMA nodes? How is creating
multiple PCIe Root Complexes on q35 using pxb-pcie different
than creating multiple PHBs using spapr-pci-host-bridge on
pseries?

> Whereas on pseries they'll
> appear as siblings on a virtual bus (which makes no physical sense for
> point-to-point PCI-E).

What is the virtual bus in question? Why would it matter
that they're siblings?

I'm possibly missing the point entirely, but so far it
looks to me like there are different configurations you
might want to use depending on your goal, and both q35
and pseries give you comparable tools to achieve such
configurations.

> I suppose we could try treating all devices on pseries as though they
> were chipset builtin devices on q35, which will appear on the root
> PCI-E bus without root complex.  But I suspect that's likely to cause
> trouble with hotplug, and it will certainly need different address
> allocation from libvirt.

PCIe Integrated Endpoint Devices are not hotpluggable on
q35, that's why libvirt will follow QEMU's PCIe topology
recommendations and place a PCIe Root Port between them;
I assume the same could be done for pseries guests as
soon as QEMU grows support for generic PCIe Root Ports,
something Marcel has already posted patches for.

Again, sorry for clearly misunderstanding your explanation,
but I'm still not seeing the issue here. I'm sure it's very
clear in your mind, but I'm afraid you're going to have to
walk me through it :(

> > Regardless of how we decide to move forward with the
> > PCIe-enabled pseries machine type, libvirt will have to
> > know about this so it can behave appropriately.
> 
> So there are kind of two extremes of how to address this.  There are a
> variety of options in between, but I suspect they're going to be even
> more muddled and hideous than the extremes.
> 
> 1) Give up.  You said there's already a flag that says a PCI-E bus is
> able to accept vanilla-PCI devices.  We add a hack flag that says a
> vanilla-PCI bus is able to accept PCI-E devices.  We keep address
> allocation as it is now - the pseries topology really does resemble
> vanilla-PCI much better than it does PCI-E. But, we allow PCI-E
> devices, and PAPR has mechanisms for accessing the extended config
> space.  PCI-E standard hotplug and error reporting will never work,
> but PAPR provides its own mechanisms for those, so that should be ok.

We can definitely special-case pseries guests and take
the "anything goes" approach to PCI vs PCIe, but it would
certainly be nicer if we could avoid presenting our users
the head-scratching situation of PCIe devices being plugged
into legacy PCI slots and still showing up as PCIe in the
guest.

What about virtio devices, which present themselves either
as legacy PCI or PCIe depending on the kind of slot they
are plugged into? Would they show up as PCIe or legacy PCI
on a PCIe-enabled pseries guest?

> 2) Start exposing the PCI-E heirarchy for pseries guests much more
> like q35, root complexes and all.  It's not clear that PAPR actually
> *forbids* exposing the root complex, it just doesn't require it and
> that's not what PowerVM does.  But.. there are big questions about
> whether existing guests will cope with this or not.  When you start
> adding in multiple passed through devices and particularly virtual
> functions as well, things could get very ugly - we might need to
> construct multiple emulated virtual root complexes or other messes.
> 
> In the short to medium term, I'm thinking option (1) seems pretty
> compelling.

Is the Root Complex not currently exposed? The Root Bus
certainly is, otherwise PCI devices won't work at all, I
assume. And I can clearly see a pci.0 bus in the output
of 'info qtree' for a pseries guest, and a pci.1 too if
I add a spapr-pci-host-bridge.

Maybe I just don't quite get the relationship between Root
Complexes and Root Buses, but I guess my question is: what
is preventing us from simply doing whatever a
spapr-pci-host-bridge is doing in order to expose a legacy
PCI Root Bus (pci.*) to the guest, and create a new
spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
instead?

> So, I'm not sure if the idea of a new machine type has legs or not,
> but let's think it through a bit further.  Suppose we have a new
> machine type, let's call it 'papr'.  I'm thinking it would be (at
> least with -nodefaults) basically a super-minimal version of pseries:
> so each PHB would have to be explicitly created, the VIO bridge would
> have to be explicitly created, likewise the NVRAM.  Not sure about the
> "devices" which really represent firmware features - the RTC, RNG,
> hypervisor event source and so forth.
> 
> Might have some advantages.  Then again, it doesn't really solve the
> specific problem here.  It means libvirt (or the user) has to
> explicitly choose a PCI or PCI-E PHB to put things on,

libvirt would probably add a

  <controller type='pci' model='pcie-root'/>

to the guest XML by default, resulting in a
spapr-pcie-host-bridge providing pcie.0 and the same
controller / address allocation logic as q35; the user
would be able to use

  <controller type='pci' model='pci-root'/>

instead to stick with legacy PCI. This would only matter
when using '-nodefaults' anyway, when that flag is not
present a PCIe (or legacy PCI) could be created by QEMU
to make it more convenient for people that are not using
libvirt.

Maybe we should have a different model, specific to
pseries guests, instead, so that all PHBs would look the
same in the guest XML, something like

  <controller type='pci' model='phb-pcie'/>

It would require shuffling libvirt's PCI address allocation
code around quite a bit, but it should be doable. And if it
makes life easier for our users, then it's worth it.

> but libvirt's
> PCI-E address allocation will still be wrong in all probability.
> 
> Guh.

> As an aside, here's a RANT.
[...]

Laine already addressed your points extensively, but I'd
like to add a few thoughts of my own.

* PCI addresses for libvirt guests don't need to be stable
  only when performing migration, but also to guarantee
  that no change in guest ABI will happen as a consequence
  of eg. a simple power cycle.

* Even if libvirt left all PCI address assignment to QEMU,
  we would need a way for users to override QEMU's choices,
  because one size never fits all and users have all kinds
  of crazy, yet valid, requirements. So the first time we
  run QEMU, we would have to take the backend-specific
  format you suggest, parse it to extract the PCI addresses
  that have been assigned, and reflect them in the guest
  XML so that the user can change a bunch of them. Then I
  guess we could re-encode it in the backend-specific format
  and pass it to QEMU the next time we run it but, at that
  point, what's the difference with simply putting the PCI
  addresses on the command line directly?

* It's not just about the addresses, by the way, but also
  about the controllers - what model is used, how they are
  plugged together and so on. More stuff that would have to
  round-trip because users need to be able to take matters
  into their own hands.

* Design mistakes in any software, combined with strict
  backwards compatibility requirements, make it difficult
  to make changes in both related components and the
  software itself, even when the changes would be very
  beneficial. It can be very frustrating at times, but
  it's the reality of things and unfortunately there's only
  so much we can do about it.

* Eduardo's work, which you mentioned, is going to be very
  beneficial in the long run; in the short run, Marcel's
  PCIe device placement guidelines, a document that has seen
  contributions from QEMU, OVMF and libvirt developers, have
  been invaluable to improve libvirt's PCI address allocation
  logic. So we're already doing better, and more improvements
  are on the way :)

-- 
Andrea Bolognani / Red Hat / Virtualization