[vfio-users] Linux iommu issue with peer-to-peer dma transfers between NVidia GTX 1080s

Mon Sep 25 15:26:56 UTC 2017

Hi Ilias, Alex,

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson at redhat.com]
> Sent: Saturday, September 23, 2017 12:26 PM
> To: Ilias Kasoumis <ilias.kasoumis at gmail.com>
> Cc: vfio-users at redhat.com; William Davis <wdavis at nvidia.com>
> Subject: Re: [vfio-users] Linux iommu issue with peer-to-peer dma transfers
> between NVidia GTX 1080s
> 
> On Sat, 23 Sep 2017 17:00:37 +0100
> Ilias Kasoumis <ilias.kasoumis at gmail.com> wrote:
> 
> > Hi,
> > I would like to draw upon the list participants' know-how and
> > experience in trying to resolve the following issue. I have tried in
> > vain to get NVidia's support in the past, I have given up for quite a
> > long time in the hope it will get fixed as a matter of course but
> > coming back to it half a year later (and multiple kernel and driver
> > versions later) I see it still persists. (The original post was https:/
> > /devtalk.nvidia.com/default/topic/996091/peer-to-peer-dma-issue-/ here
> > and I am copying below)
> >
> >
> > The bug that makes the use of multiple GTX1080's impossible when I turn
> > on the IOMMU in Linux (tried kernels 4.8 and 4.13, using either
> > standard iommu=on or iommu=on,igfx_off or iommu=pt for passthrough
> > mode) on a X99 board.
> >
> > The bug can be triggered by running any peer-to-peer memory transfer,
> > for example running the CUDA 8.0 Samples code
> > 1_Utilities/p2pBandwidthLatencyTest from the terminal triggers the
> > problem: the video driver (and as a result the X server) crashes
> > immediately, and after multiple Ctrl-C's and waiting for tens of
> > seconds the server eventually restarts and I am presented with a login
> > prompt to X Windows.
> >
> > The relevant kernel error messages are (thousands of these lines, just
> > a snippet below:)
> >
> > [   51.691440] DMAR: DRHD: handling fault status reg 2
> > [   51.691450] DMAR: [DMA Write] Request device [04:00.0] fault addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.691457] DMAR: [DMA Write] Request device [04:00.0] fault addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.691462] DMAR: [DMA Write] Request device [04:00.0] fault addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.691465] DMAR: [DMA Write] Request device [04:00.0] fault addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.691470] DMAR: DRHD: handling fault status reg 400
> > [   51.740674] DMAR: DRHD: handling fault status reg 402
> > [   51.740683] DMAR: [DMA Write] Request device [04:00.0] fault addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.740688] DMAR: [DMA Write] Request device [04:00.0] fault addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.740693] DMAR: [DMA Write] Request device [04:00.0] fault addr
> > f8139000 [fault reason 05] PTE Write access is not set
> >
> > Cleary the above suggest that the CUDA driver is attempting DMA at an
> > address for which the corresponding iommu page table entry write flag
> > is not set, presumably because the driver has not properly
> > registered/requested access via the general dma_map() kernel interface
> > (https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt)
> >
> > Scouting the net reveals a bug registered (https://bugzilla.kernel.org/
> > show_bug.cgi?id=188271) for exactly the same reason on totally
> > different hardware (Supermicro Dual socket board) using Pascal Titan-
> > X's, so same architecture cards as mine. Interestingly enough, the
> > kernel error messages in this report claim unauthorized access of
> > *exactly* the same memory address! (f8139000, in bold below) :
> >
> > [16193.666976] DMAR: [DMA Write] Request device [82:00.0] fault addr f8
> > 139000 [fault reason 05] PTE Write access is not set (edited)
> >
> > So this looks like a red flag that somehow the indirection afforded by
> > the iommu is bypassed and the driver is using hardcoded DMA addresses.

The reason you see the same address here is that NVIDIA GPUDirect uses certain BAR0 registers to setup and synchronize peer-to-peer transfers. Those registers start at offset 0x139000 of the target device's BAR0, so the only coincidence here is that that the BAR0 of the target device on both systems is 0xf8000000.

More below on why this doesn't look like an iommu-mapped address.

> > Please note that the author of the bug report claims that seting
> > iommu=igfx_off somehow solves this, but really igfx_off per se should
> > be irrelevant here without turning the iommu support on first, with
> > something like iommu=on,igfx_off. What instead happens is that most
> > likely iommu=igfx_off as opposed to iommu=on just turns off iommu
> > altogether, allowing the dma to succeed. This is exactly what happens
> > on my system too. So in other words the bug report merely states that
> > turning off the iommu allows peer-to-peer tranfers to work. Still his
> > detailed log files should be very useful for an independent
> > manifestation of the same issue. My log files are attached on the
> > original thread included at the start of this post.
> >
> > I am using an ASRock X99 board (x99e-itx/ac) with latest firmware,
> > intel i6800k, dual Asus GTX-1080s Founder's Edition, 32GB ram and
> > Ubuntu 16.10 (or 17.10 now) with all updates applied (kernel 4.8.0-37
> > or 4.13 now) with driver 378.13 or 384.69.
> >
> > Have you come across this while trying to virtualize nvidia GPUs? Given
> > the Linux driver forum at nvidia refuses to display bug posts by users
> > (they remain "hidden") and given nvidia would much have you buy
> > quadro's and tesla's instead the conspiracy theorist in me is more
> > inclined to believe that vt-d is intentionally disabled in consumer
> > versions of the hardware...

I can assure you that the NVIDIA driver does not withhold support for iommu-enabled configurations to implement product segmentation.

> >
> > Thanks for any input/solutions!
> 
> IME, supported Quadro fail in the same way on a bare metal host with
> iommu enabled running the p2pBandwidthLatencyTest from the cuda tests.
> K4000 did the same thing for me.  Also note that the igfx_off option is
> specifically an intel_iommu parameter and IIRC, only changes the
> behavior of integrated ('i' in igfx) graphics.  As you're on X99, this
> option is irrelevant.  I haven't investigated why iommu=pt doesn't work
> here, X99 should have hardware passthrough support in the DRHD, but
> maybe it doesn't work for p2p.
> 
> Since you're asking vfio-users about this bare metal iommu issue, let
> me also note this QEMU patch series:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg05826.html
> 
> I have no idea if NVIDIA enables GPUDirect on GeForce cards, but you
> might actually be able to do what you're looking for within a VM since
> vfio will map all memory and mmio through the iommu.  These mappings
> are transparent for the guest kernel and userspace, so it just works.
> 
> Perhaps NVIDIA hasn't added DMA-API support to their driver for these
> use cases simply because of the iommu overhead.  If devices are
> operating in a virtual address space (iova), all transactions need to
> pass through the iommu for translation. 

Yes, there certainly is an overhead associated with mapping peer memory through the iommu, although "90% performance" (or however much it ends up being - I believe this varies depending on the hardware/use case) is still much better than "broken"; we would use the DMA-API for this if we could.

The main reason we do not call the kernel DMA-API for mapping peer device BARs for access by other devices is because the kernel has not provided an API to do so. The iommu map_* API callbacks (with one exception mentioned below) assume that the memory being mapped is backed by a struct page, which PCI device BARs are not [1].

I attempted to add a DMA-API op for this a couple years ago [2], but was unable to provide an in-tree user for the API at the time so I dropped the patch series. Happily, someone recently pointed out to me that a dma_map_resource() API was added about a year ago [3] for some ARM drivers it appears, and that's exactly what the NVIDIA driver would need to use to be able to support GPUDirect through an iommu. Less happily, that API currently appears to be a no-op for most platforms on which NVIDIA GPUs are supported, including Intel x86, so this wouldn't help with your immediate problem.

Alex's recommendation of trying to do this within a VM seems like the quickest path to getting something working on your end, since the mmio regions are already mapped through the iommu. The referenced qemu patch or something similar is needed in that case because the VMM will flatten out/obfuscate the PCI topology and the NVIDIA driver will disallow GPUDirect without additional information from the VMM about peer communicability - which we try to determine in advance of attempting peer-to-peer transfers that could otherwise hang or crash the host.

> In order to get p2p
> directly through switches downsteam in the topology, the switch needs to
> support ACS Direct Translation and the endpoints need to supports
> Address Translation Services ATS.  NVIDIA devices do not support the
> latter and ACS DT is a mostly unexplored space.  Since you're
> using 1080s which only have a single GPU per card, switches are
> maybe not involved unless they're built into your motherboard.  Thanks,
> 
> Alex

[1] With at least one of the iommu drivers, IIRC you could trick the map_single API into mapping peer BAR memory because it passed a struct page pointer around to determine relative physical address but never actually dereferenced it. We did not feel it was safe to rely on this behavior in the NVIDIA kernel driver.

[2] https://lists.linuxfoundation.org/pipermail/iommu/2015-September/014250.html

[3] https://github.com/torvalds/linux/commit/6f3d87968f9c8b529bc81eff5a1f45e92553493d

--Will

--
nvpublic