[libvirt] Proposal PCI/PCIe device placement on PAPR guests

Fri Jan 13 06:03:24 UTC 2017

On 13/01/17 15:48, David Gibson wrote:
> On Thu, Jan 12, 2017 at 10:09:03AM +0100, Greg Kurz wrote:
>> On Thu, 12 Jan 2017 17:19:40 +1100
>> Alexey Kardashevskiy <aik at ozlabs.ru> wrote:
>>
>>> On 12/01/17 14:52, David Gibson wrote:
>>>> On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:  
>>>>> On Thu, 5 Jan 2017 16:46:18 +1100
>>>>> David Gibson <david at gibson.dropbear.id.au> wrote:
>>>>>  
>>>>>> There was a discussion back in November on the qemu list which spilled
>>>>>> onto the libvirt list about how to add support for PCIe devices to
>>>>>> POWER VMs, specifically 'pseries' machine type PAPR guests.
>>>>>>
>>>>>> Here's a more concrete proposal for how to handle part of this in
>>>>>> future from the libvirt side.  Strictly speaking what I'm suggesting
>>>>>> here isn't intrinsically linked to PCIe: it will make adding PCIe
>>>>>> support sanely easier, as well as having a number of advantages for
>>>>>> both PCIe and plain-PCI devices on PAPR guests.
>>>>>>
>>>>>> Background:
>>>>>>
>>>>>>  * Currently the pseries machine type only supports vanilla PCI
>>>>>>    buses.
>>>>>>     * This is a qemu limitation, not something inherent - PAPR guests
>>>>>>       running under PowerVM (the IBM hypervisor) can use passthrough
>>>>>>       PCIe devices (PowerVM doesn't emulate devices though).
>>>>>>     * In fact the way PCI access is para-virtalized in PAPR makes the
>>>>>>       usual distinctions between PCI and PCIe largely disappear
>>>>>>  * Presentation of PCIe devices to PAPR guests is unusual
>>>>>>     * Unlike x86 - and other "bare metal" platforms, root ports are
>>>>>>       not made visible to the guest. i.e. all devices (typically)
>>>>>>       appear as though they were integrated devices on x86
>>>>>>     * In terms of topology all devices will appear in a way similar to
>>>>>>       a vanilla PCI bus, even PCIe devices
>>>>>>        * However PCIe extended config space is accessible
>>>>>>     * This means libvirt's usual placement of PCIe devices is not
>>>>>>       suitable for PAPR guests
>>>>>>  * PAPR has its own hotplug mechanism
>>>>>>     * This is used instead of standard PCIe hotplug
>>>>>>     * This mechanism works for both PCIe and vanilla-PCI devices
>>>>>>     * This can hotplug/unplug devices even without a root port P2P
>>>>>>       bridge between it and the root "bus
>>>>>>  * Multiple independent host bridges are routine on PAPR
>>>>>>     * Unlike PC (where all host bridges have multiplexed access to
>>>>>>       configuration space) PCI host bridges (PHBs) are truly
>>>>>>       independent for PAPR guests (disjoint MMIO regions in system
>>>>>>       address space)
>>>>>>     * PowerVM typically presents a separate PHB to the guest for each
>>>>>>       host slot passed through
>>>>>>
>>>>>> The Proposal:
>>>>>>
>>>>>> I suggest that libvirt implement a new default algorithm for placing
>>>>>> (i.e. assigning addresses to) both PCI and PCIe devices for (only)
>>>>>> PAPR guests.
>>>>>>
>>>>>> The short summary is that by default it should assign each device to a
>>>>>> separate vPHB, creating vPHBs as necessary.
>>>>>>
>>>>>>   * For passthrough sometimes a group of host devices can't be safely
>>>>>>     isolated from each other - this is known as a (host) Partitionable
>>>>>>     Endpoint (PE).  In this case, if any device in the PE is passed
>>>>>>     through to a guest, the whole PE must be passed through to the
>>>>>>     same vPHB in the guest.  From the guest POV, each vPHB has exactly
>>>>>>     one (guest) PE.
>>>>>>   * To allow for hotplugged devices, libvirt should also add a number
>>>>>>     of additional, empty vPHBs (the PAPR spec allows for hotplug of
>>>>>>     PHBs, but this is not yet implemented in qemu).  When hotplugging
>>>>>>     a new device (or PE) libvirt should locate a vPHB which doesn't
>>>>>>     currently contain anything.
>>>>>>   * libvirt should only (automatically) add PHBs - never root ports or
>>>>>>     other PCI to PCI bridges
>>>>>>
>>>>>> In order to handle migration, the vPHBs will need to be represented in
>>>>>> the domain XML, which will also allow the user to override this
>>>>>> topology if they want.
>>>>>>
>>>>>> Advantages:
>>>>>>
>>>>>> There are still some details I need to figure out w.r.t. handling PCIe
>>>>>> devices (on both the qemu and libvirt sides).  However the fact that  
>>>>>
>>>>> One such detail may be that PCIe devices should have the
>>>>> "ibm,pci-config-space-type" property set to 1 in the DT,
>>>>> for the driver to be able to access the extended config
>>>>> space.  
>>>>
>>>> So, we have a bit of an oddity here.  It looks like we currently set
>>>> 'ibm,pci-config-space-type' to 1 in the PHB, rather than individual
>>>> device nodes.  Which, AFAICT, is simply incorrect in terms of PAPR.  
>>>
>>>
>>> I asked Paul how to read the spec and this is rather correct but not enough
>>> - having type=1 on a PHB means that extended access requests can go behind
>>> it but underlying devices and bridges still need to have type=1 if they
>>> support extended space. Having type set to 0 (or none at all) on a PHB
>>> would mean that extended config space is not available on anything under
>>> this PHB.
>>>
>>
>> I have the very same understanding of the spec (LoPAPR March 2016):
>>
>> R1–9.1.8–2. All IOAs that implement PCI-X Mode 2 or PCI Express must supply the “ibm,pci-con-
>> fig-space-type” property (see Section B.6.5.1.1.1‚ “Properties for Children of PCI Host Bridges‚” on
>> page 703).
>>
>> Implementation Note: The “ibm,pci-config-space-type” property in Requirement R1–9.1.8–2 is added for
>> platforms that support I/O fabric and IOAs that implement PCI-X Mode 2, and PCI Express. To access the
>> extended configuration space provided by PCI-X Mode 2 and PCI Express, all I/O fabric leading up to an IOA
>> must support a 12-bit register number. In other words, if a platform implementation has a conventional PCI bridge
>> leading up to an IOA that implements PCI-X Mode 2, the platform will not be able to provide access to the
>> extended configuration space of that IOA. The “ibm,config-space-type” property in the IOA's OF node
>> is used by device drivers to determine if an IOA’s extended configuration space can be accessed.
>>
>> and
>>
>> B.6.5.1.1.1 Properties for Children of PCI Host Bridges
>>
>> “ibm,pci-config-space-type”
>> property name: Indicates if the platform supports access to an extended configuration address space from the PHB
>> up to and including this node.
>> 0 = Platform supports only an eight bit register number for configuration address space accesses.
>> 1 = Platform supports a twelve bit register number for configuration address space accesses.
>> This property may be provided in all PHB nodes and their children.
>> Note: The absence of this property implies the platform supports only an eight bit register number for configura-
>> tion address space accesses.
>>
>>
>> And incidentally, this is what the linux kernel currently expects. See these lines
>> from arch/powerpc/kernel/pci_dn.c:
>>
>> struct pci_dn *pci_add_device_node_info(struct pci_controller *hose,
>>                                         struct device_node *dn)
>> {
>>         const __be32 *type = of_get_property(dn, "ibm,pci-config-space-type", NULL);
>> .
>> .
>> .
>>         /* Extended config space */
>>         pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1);
> 
> Ok, thanks for the information.
> 
>> I had to rework Alexey's "spapr_pci: Create PCI-express root bus  by default"
>> patch to be able to see the extended config space of a vfio-pci device:
> 
> Ah!  Is there an easy command line way to verify that extended config
> space is accessible?

I do "lspci -vvs "0003:01:00.3 and look for "Capabilities: [xxx v1]" where
xxx >= 0x100.

-- 
Alexey

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 839 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20170113/fcbb6fa6/attachment-0001.sig>