[vfio-users] Linux iommu issue with peer-to-peer dma transfers between NVidia GTX 1080s

Sun Sep 24 09:26:27 UTC 2017

Thanks for input on this - I am surprised to hear that Quadros and
K4000 exhibit the same issues on bare metal! This means I need to
revise my expectations. I will look at your suggestion involving
virtual cliques and see where it takes me..

On Sat, 2017-09-23 at 11:26 -0600, Alex Williamson wrote:
> On Sat, 23 Sep 2017 17:00:37 +0100
> Ilias Kasoumis <ilias.kasoumis at gmail.com> wrote:
> 
> > Hi, 
> > I would like to draw upon the list participants' know-how and
> > experience in trying to resolve the following issue. I have tried
> > in
> > vain to get NVidia's support in the past, I have given up for quite
> > a
> > long time in the hope it will get fixed as a matter of course but
> > coming back to it half a year later (and multiple kernel and driver
> > versions later) I see it still persists. (The original post was
> > https:/
> > /devtalk.nvidia.com/default/topic/996091/peer-to-peer-dma-issue-/
> > here
> > and I am copying below)
> > 
> > 
> > The bug that makes the use of multiple GTX1080's impossible when I
> > turn
> > on the IOMMU in Linux (tried kernels 4.8 and 4.13, using either
> > standard iommu=on or iommu=on,igfx_off or iommu=pt for passthrough
> > mode) on a X99 board.
> > 
> > The bug can be triggered by running any peer-to-peer memory
> > transfer,
> > for example running the CUDA 8.0 Samples code
> > 1_Utilities/p2pBandwidthLatencyTest from the terminal triggers the
> > problem: the video driver (and as a result the X server) crashes
> > immediately, and after multiple Ctrl-C's and waiting for tens of
> > seconds the server eventually restarts and I am presented with a
> > login
> > prompt to X Windows.
> > 
> > The relevant kernel error messages are (thousands of these lines,
> > just
> > a snippet below:) 
> > 
> > [   51.691440] DMAR: DRHD: handling fault status reg 2
> > [   51.691450] DMAR: [DMA Write] Request device [04:00.0] fault
> > addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.691457] DMAR: [DMA Write] Request device [04:00.0] fault
> > addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.691462] DMAR: [DMA Write] Request device [04:00.0] fault
> > addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.691465] DMAR: [DMA Write] Request device [04:00.0] fault
> > addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.691470] DMAR: DRHD: handling fault status reg 400
> > [   51.740674] DMAR: DRHD: handling fault status reg 402
> > [   51.740683] DMAR: [DMA Write] Request device [04:00.0] fault
> > addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.740688] DMAR: [DMA Write] Request device [04:00.0] fault
> > addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > [   51.740693] DMAR: [DMA Write] Request device [04:00.0] fault
> > addr
> > f8139000 [fault reason 05] PTE Write access is not set
> > 
> > Cleary the above suggest that the CUDA driver is attempting DMA at
> > an
> > address for which the corresponding iommu page table entry write
> > flag
> > is not set, presumably because the driver has not properly
> > registered/requested access via the general dma_map() kernel
> > interface
> > (https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt) 
> > 
> > Scouting the net reveals a bug registered (https://bugzilla.kernel.
> > org/
> > show_bug.cgi?id=188271) for exactly the same reason on totally
> > different hardware (Supermicro Dual socket board) using Pascal
> > Titan-
> > X's, so same architecture cards as mine. Interestingly enough, the
> > kernel error messages in this report claim unauthorized access of
> > *exactly* the same memory address! (f8139000, in bold below) :
> > 
> > [16193.666976] DMAR: [DMA Write] Request device [82:00.0] fault
> > addr f8
> > 139000 [fault reason 05] PTE Write access is not set (edited)
> > 
> > So this looks like a red flag that somehow the indirection afforded
> > by
> > the iommu is bypassed and the driver is using hardcoded DMA
> > addresses.
> > Please note that the author of the bug report claims that seting
> > iommu=igfx_off somehow solves this, but really igfx_off per se
> > should
> > be irrelevant here without turning the iommu support on first, with
> > something like iommu=on,igfx_off. What instead happens is that most
> > likely iommu=igfx_off as opposed to iommu=on just turns off iommu
> > altogether, allowing the dma to succeed. This is exactly what
> > happens
> > on my system too. So in other words the bug report merely states
> > that
> > turning off the iommu allows peer-to-peer tranfers to work. Still
> > his
> > detailed log files should be very useful for an independent
> > manifestation of the same issue. My log files are attached on the
> > original thread included at the start of this post.
> > 
> > I am using an ASRock X99 board (x99e-itx/ac) with latest firmware,
> > intel i6800k, dual Asus GTX-1080s Founder's Edition, 32GB ram and
> > Ubuntu 16.10 (or 17.10 now) with all updates applied (kernel 4.8.0-
> > 37
> > or 4.13 now) with driver 378.13 or 384.69.
> > 
> > Have you come across this while trying to virtualize nvidia GPUs?
> > Given
> > the Linux driver forum at nvidia refuses to display bug posts by
> > users
> > (they remain "hidden") and given nvidia would much have you buy
> > quadro's and tesla's instead the conspiracy theorist in me is more
> > inclined to believe that vt-d is intentionally disabled in consumer
> > versions of the hardware...
> > 
> > Thanks for any input/solutions! 
> 
> IME, supported Quadro fail in the same way on a bare metal host with
> iommu enabled running the p2pBandwidthLatencyTest from the cuda
> tests.
> K4000 did the same thing for me.  Also note that the igfx_off option
> is
> specifically an intel_iommu parameter and IIRC, only changes the
> behavior of integrated ('i' in igfx) graphics.  As you're on X99,
> this
> option is irrelevant.  I haven't investigated why iommu=pt doesn't
> work
> here, X99 should have hardware passthrough support in the DRHD, but
> maybe it doesn't work for p2p.
> 
> Since you're asking vfio-users about this bare metal iommu issue, let
> me also note this QEMU patch series:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg05826.html
> 
> I have no idea if NVIDIA enables GPUDirect on GeForce cards, but you
> might actually be able to do what you're looking for within a VM
> since
> vfio will map all memory and mmio through the iommu.  These mappings
> are transparent for the guest kernel and userspace, so it just works.
> 
> Perhaps NVIDIA hasn't added DMA-API support to their driver for these
> use cases simply because of the iommu overhead.  If devices are
> operating in a virtual address space (iova), all transactions need to
> pass through the iommu for translation.  In order to get p2p
> directly through switches downsteam in the topology, the switch needs
> to
> support ACS Direct Translation and the endpoints need to supports
> Address Translation Services ATS.  NVIDIA devices do not support the
> latter and ACS DT is a mostly unexplored space.  Since you're
> using 1080s which only have a single GPU per card, switches are
> maybe not involved unless they're built into your
> motherboard.  Thanks,
> 
> Alex