[libvirt] Proposal PCI/PCIe device placement on PAPR guests

Wed Jan 18 12:27:44 UTC 2017

On 01/13/2017 01:14 AM, David Gibson wrote:
> On Thu, Jan 12, 2017 at 11:03:05AM -0500, Laine Stump wrote:
>> On 01/05/2017 12:46 AM, David Gibson wrote:
>>> There was a discussion back in November on the qemu list which spilled
>>> onto the libvirt list about how to add support for PCIe devices to
>>> POWER VMs, specifically 'pseries' machine type PAPR guests.
>>>
>>> Here's a more concrete proposal for how to handle part of this in
>>> future from the libvirt side.  Strictly speaking what I'm suggesting
>>> here isn't intrinsically linked to PCIe: it will make adding PCIe
>>> support sanely easier, as well as having a number of advantages for
>>> both PCIe and plain-PCI devices on PAPR guests.
>>>
>>> Background:
>>>
>>>   * Currently the pseries machine type only supports vanilla PCI
>>>     buses.
>>>      * This is a qemu limitation, not something inherent - PAPR guests
>>>        running under PowerVM (the IBM hypervisor) can use passthrough
>>>        PCIe devices (PowerVM doesn't emulate devices though).
>>>      * In fact the way PCI access is para-virtalized in PAPR makes the
>>>        usual distinctions between PCI and PCIe largely disappear
>>>   * Presentation of PCIe devices to PAPR guests is unusual
>>>      * Unlike x86 - and other "bare metal" platforms, root ports are
>>>        not made visible to the guest. i.e. all devices (typically)
>>>        appear as though they were integrated devices on x86
>>>      * In terms of topology all devices will appear in a way similar to
>>>        a vanilla PCI bus, even PCIe devices
>>>         * However PCIe extended config space is accessible
>>>      * This means libvirt's usual placement of PCIe devices is not
>>>        suitable for PAPR guests
>>>   * PAPR has its own hotplug mechanism
>>>      * This is used instead of standard PCIe hotplug
>>>      * This mechanism works for both PCIe and vanilla-PCI devices
>>>      * This can hotplug/unplug devices even without a root port P2P
>>>        bridge between it and the root "bus
>>>   * Multiple independent host bridges are routine on PAPR
>>>      * Unlike PC (where all host bridges have multiplexed access to
>>>        configuration space) PCI host bridges (PHBs) are truly
>>>        independent for PAPR guests (disjoint MMIO regions in system
>>>        address space)
>>>      * PowerVM typically presents a separate PHB to the guest for each
>>>        host slot passed through
>>>
>>> The Proposal:
>>>
>>> I suggest that libvirt implement a new default algorithm for placing
>>> (i.e. assigning addresses to) both PCI and PCIe devices for (only)
>>> PAPR guests.
>>>
>>> The short summary is that by default it should assign each device to a
>>> separate vPHB, creating vPHBs as necessary.
>>>
>>>    * For passthrough sometimes a group of host devices can't be safely
>>>      isolated from each other - this is known as a (host) Partitionable
>>>      Endpoint (PE).  In this case, if any device in the PE is passed
>>>      through to a guest, the whole PE must be passed through to the
>>>      same vPHB in the guest.  From the guest POV, each vPHB has exactly
>>>      one (guest) PE.
>>>    * To allow for hotplugged devices, libvirt should also add a number
>>>      of additional, empty vPHBs (the PAPR spec allows for hotplug of
>>>      PHBs, but this is not yet implemented in qemu).  When hotplugging
>>>      a new device (or PE) libvirt should locate a vPHB which doesn't
>>>      currently contain anything.
>>>    * libvirt should only (automatically) add PHBs - never root ports or
>>>      other PCI to PCI bridges
>>
>>
>> It's a bit unconventional to leave all but one slot of a controller unused,
>
> Unconventional for x86, maybe.  It's been SOP on IBM Power for a
> decade or more.  Both for PAPR guests and in some cases on the
> physical hardware (AIUI many, though not all, Power systems used a
> separate host bridge for each physical slot to ensure better isolation
> between devices).
>
>> but your thinking makes sense. I don't think this will be as
>> large/disruptive of a change as you might be expecting - we already have
>> different addressing rules to automatically addressed vs. manually
>> addressed, as well as a framework in place to behave differently for
>> different PCI controllers (e.g. some support hotplug and others don't), and
>> to modify behavior based on machinetype / root bus model, so it should be
>> straightforward to do make things behave as you outline above.
>
> Actually, I had that impression, so I was hoping it wouldn't be too
> bad to implement.  I'd really like to get this underway ASAP, so we
> can build the PCIe support (both qemu and Power) around that.
>
>> (The first item in your list sounds exactly like VFIO iommu groups. Is that
>> how it's exposed on PPC?
>
> Yes, for Power hosts and guests there's a 1-1 correspondance between
> PEs and IOMMU groups.  Technically speaking, I believe the PE provides
> more isolation guarantees than the IOMMU group, but they're generally
> close enough in practice.
>
>> If so, libvirt already takes care of guaranteeing
>> that any devices in the same group aren't used by other guests or the host
>> during the time a guest is using a device.
>
> Yes, I'm aware of that, that's not an aspect I was concerned about.
>
> Although that said, last I heard there was a bug in libvirt which
> on hot *un*plug could assign devices back to the host without waiting
> for other assigned devices in the group, which can crash the host.
>
>> It doesn't automatically assign
>> the other devices to the guest though, since this could have unexpected
>> effects on host operation (the example that kept coming up when this was
>> originally discussed wrt vfio device assignment was the case where a disk
>> device in use on the host was attached to a controller in the same iommu
>> group as a USB controller that was going to be assigned to a guest -
>> silently assigning the disk controller to the guest would cause the host's
>> disk to suddenly become unusable).)
>
> Um.. what!?  If something in the group is assigned to the guest, other
> devices in the group MUST NOT be used by the hast, regardless of
> whether they are actually assigned to the guest or not.  In the
> situation you describe the guest would control the IOMMU mappings for
> the host's disk device making it no more usable, and a whole lot more
> dangerous.
>
> In fact I'm pretty sure VFIO won't let you do that: it won't let you
> add the group to a VFIO container until all devices in the group are
> bound to the VFIO stub driver instead of whatever host driver they
> were using before.
>
>>> In order to handle migration, the vPHBs will need to be represented in
>>> the domain XML, which will also allow the user to override this
>>> topology if they want.
>>>
>>> Advantages:
>>>
>>> There are still some details I need to figure out w.r.t. handling PCIe
>>> devices (on both the qemu and libvirt sides).  However the fact that
>>> PAPR guests don't typically see PCIe root ports means that the normal
>>> libvirt PCIe allocation scheme won't work.
>>
>>
>> Well, the "normal libvirt PCIe allocation scheme" assumes "normal PCIe" :-).
>
> My point exactly.
>
>>
>>
>>>    This scheme has several
>>> advantages with or without support for PCIe devices:
>>>
>>>   * Better performance for 32-bit devices
>>>
>>> With multiple devices on a single vPHB they all must share a (fairly
>>> small) 32-bit DMA/IOMMU window.  With separate PHBs they each have a
>>> separate window.  PAPR guests have an always-on guest visible IOMMU.
>>>
>>>   * Better EEH handling for passthrough devices
>>>
>>> EEH is an IBM hardware-assisted mechanism for isolating and safely
>>> resetting devices experiencing hardware faults so they don't bring
>>> down other devices or the system at large.  It's roughly similar to
>>> PCIe AER in concept, but has a different IBM specific interface, and
>>> works on both PCI and PCIe devices.
>>>
>>> Currently the kernel interfaces for handling EEH events on passthrough
>>> devices will only work if there is a single (host) iommu group in the
>>> vfio container.  While lifting that restriction would be nice, it's
>>> quite difficult to do so (it requires keeping state synchronized
>>> between multiple host groups).  That also means that an EEH error on
>>> one device could stop another device where that isn't required by the
>>> actual hardware.
>>>
>>> The unit of EEH isolation is a PE (Partitionable Endpoint) and
>>> currently there is only one guest PE per vPHB.  Changing this might
>>> also be possible, but is again quite complex and may result in
>>> confusing and/or broken distinctions between groups for EEH isolation
>>> and IOMMU isolation purposes.
>>>
>>> Placing separate host groups in separate vPHBs sidesteps these
>>> problems.
>>>
>>>   * Guest NUMA node assignment of devices
>>>
>>> PAPR does not (and can't reasonably) use the pxb device.  Instead to
>>> allocate devices to different guest NUMA nodes they should be placed
>>> on different vPHBs.  Placing them on different PHBs by default allows
>>> NUMA node to be assigned to those PHBs in a straightforward manner.
>>
>> So far libvirt doesn't try to assign PCI addresses to devices according to
>> NUMA node, but assumes that the management application will manually address
>> devices that need to be put on a particular pxb (it's only since the recent
>> advent of the pxb that guests have become aware of multiple NUMA nodes).
>> Possibly in the future libvirt will attempt to automatically place devices
>> on a pxb that matches its NUMA node (if it exists). We don't want to force
>> use of pxb for all guests on a host that has multiple NUMA nodes though.
>> This might make more sense on PCC though since all devices are on a PHB, and
>> each PHB can have a NUMA node set.
>
> I hope the connection of pxb to NUMA allocation isn't too tight in
> libvirt.  pxb is essentially an x86 specific hack.  For PAPR guests
> the correct way to assign a NUMA node to PCI devices is to put them on
> separate vPHBs and set the node of the vPHB.  The point above is
> noting that once this proposal is implemented, all we need to do to
> add NUMA awareness is allow a NUMA node property on the vPHBs.
>

The X86/(ACPI based ARM?) pxb works exactly like that.
It has a numa_nr property that will associate the PXB to a guest NUMA node.
All devices behind that pxb are connected to pxb's NUMA node.

Thanks,
Marcel