[libvirt] [PATCH v4 2/2] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough
Peter Krempa
pkrempa at redhat.com
Tue Apr 2 07:37:56 UTC 2019
On Tue, Mar 12, 2019 at 18:55:50 -0300, Daniel Henrique Barboza wrote:
> The NVIDIA V100 GPU has an onboard RAM that is mapped into the
> host memory and accessible as normal RAM via an NVLink2 bridge. When
> passed through in a guest, QEMU puts the NVIDIA RAM window in a
> non-contiguous area, above the PCI MMIO area that starts at 32TiB.
> This means that the NVIDIA RAM window starts at 64TiB and go all the
> way to 128TiB.
>
> This means that the guest might request a 64-bit window, for each PCI
> Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM
> window isn't counted as regular RAM, thus this window is considered
> only for the allocation of the Translation and Control Entry (TCE).
>
> This memory layout differs from the existing VFIO case, requiring its
> own formula. This patch changes the PPC64 code of
> @qemuDomainGetMemLockLimitBytes to:
>
> - detect if we have a NVLink2 bridge being passed through to the
> guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function
> added in the previous patch. The existence of the NVLink2 bridge in
> the guest means that we are dealing with the NVLink2 memory layout;
>
> - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a
> different way to account for the extra memory the TCE table can alloc.
> The 64TiB..128TiB window is more than enough to fit all possible
> GPUs, thus the memLimit is the same regardless of passing through 1 or
> multiple V100 GPUs.
>
> Signed-off-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
> ---
> src/qemu/qemu_domain.c | 42 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
> index dcc92d253c..6d1a69491d 100644
> --- a/src/qemu/qemu_domain.c
> +++ b/src/qemu/qemu_domain.c
> @@ -10443,7 +10443,10 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
> unsigned long long maxMemory = 0;
> unsigned long long passthroughLimit = 0;
> size_t i, nPCIHostBridges = 0;
> + virPCIDeviceAddressPtr pciAddr;
> + char *pciAddrStr = NULL;
> bool usesVFIO = false;
> + bool nvlink2Capable = false;
>
> for (i = 0; i < def->ncontrollers; i++) {
> virDomainControllerDefPtr cont = def->controllers[i];
> @@ -10461,7 +10464,16 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
> dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI &&
> dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) {
> usesVFIO = true;
> - break;
> +
> + pciAddr = &dev->source.subsys.u.pci.addr;
> + if (virPCIDeviceAddressIsValid(pciAddr, false)) {
> + pciAddrStr = virPCIDeviceAddressAsString(pciAddr);
Again this leaks the PCI address string on every iteration and on exit
from this function.
> + if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) {
> + nvlink2Capable = true;
> + break;
> + }
> + }
> +
> }
> }
>
> @@ -10488,6 +10500,32 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
> 4096 * nPCIHostBridges +
> 8192;
>
> + /* NVLink2 support in QEMU is a special case of the passthrough
> + * mechanics explained in the usesVFIO case below. The GPU RAM
> + * is placed with a gap after maxMemory. The current QEMU
> + * implementation puts the NVIDIA RAM above the PCI MMIO, which
> + * starts at 32TiB and is the MMIO reserved for the guest main RAM.
> + *
> + * This window ends at 64TiB, and this is where the GPUs are being
> + * placed. The next available window size is at 128TiB, and
> + * 64TiB..128TiB will fit all possible NVIDIA GPUs.
> + *
> + * The same assumption as the most common case applies here:
> + * the guest will request a 64-bit DMA window, per PHB, that is
> + * big enough to map all its RAM, which is now at 128TiB due
> + * to the GPUs.
> + *
> + * Note that the NVIDIA RAM window must be accounted for the TCE
> + * table size, but *not* for the main RAM (maxMemory). This gives
> + * us the following passthroughLimit for the NVLink2 case:
Citation needed. Please link a source for these claims. We have some
sources for claims on x86_64 even if they are not exactly scientific.
> + *
> + * passthroughLimit = maxMemory +
> + * 128TiB/512KiB * #PHBs + 8 MiB */
> + if (nvlink2Capable)
Please add curly braces to this condition as it's multi-line and also
has big comment inside of it.
> + passthroughLimit = maxMemory +
> + 128 * (1ULL<<30) / 512 * nPCIHostBridges +
> + 8192;
I don't quite understand why this formula uses maxMemory while the vfio
case uses just 'memory'.
> +
> /* passthroughLimit := max( 2 GiB * #PHBs, (c)
> * memory (d)
> * + memory * 1/512 * #PHBs + 8 MiB ) (e)
> @@ -10507,7 +10545,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
> * kiB pages, less still if the guest is mapped with hugepages (unlike
> * the default 32-bit DMA window, DDW windows can use large IOMMU
> * pages). 8 MiB is for second and further level overheads, like (b) */
> - if (usesVFIO)
> + else if (usesVFIO)
So can't there be a case when a nvlink2 device is present but also e.g.
vfio network cards?
> passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges,
> memory +
> memory / 512 * nPCIHostBridges + 8192);
Also add curly braces here when you are at it.
> --
> 2.20.1
>
> --
> libvir-list mailing list
> libvir-list at redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20190402/394d65e8/attachment-0001.sig>
More information about the libvir-list
mailing list