[vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia driver crashes often, seemingly under load

Brian Yglesias brian at atlanticdigitalsolutions.com
Thu Jul 7 21:10:13 UTC 2016


>
>> I've been trying to get GPU passthrough to work more reliably for a few
>> days.
>>
>> I have an Asus Rampage III Forumula (X58 chipset LGA1366) with latest
>> bios, Xeon X5670, kernel 4.4.13, quemu 2.5.1.1.  I'm passing through a GTX
>> 660 and a GTX 970, sometimes to two different VMs, and sometimes to the
>> same one.
>>
>>
> i have a gtx970 and it works pretty well for gpu passthru.
> but i'm not so sure a 660 will work and i suspect you will have reset
> issues.
>

I'm not sure what you mean by resetting issues.  I see it referred to mostly in the context of AMD cards, so I haven't been paying much attention to that.  I've had issues with VMs not soft resetting correctly, and needing to be hard reset.  I've had issues with the driver crashing/recovering.  Rebooting the host does not seem to help anything.

>
>Seems to be some growing FUD with nvidia and reset issues.  AFAIK, there
>are no reset issues for Kepler and newer cards, including the 660.  Fermi
>cards always seem to cause problems, but I don't necessarily think it's
>reset related.  Reset problems on nvidia are more likely a result of trying
>to assign the primary host graphics or getting the card into a bad state
>with host graphics drivers.  I have a GTX660, it doesn't get used often for
>this purpose but IIRC, it works just fine.

To try to eliminate the possibility of a host driver causing the issue I:

-Added the following to /etc/default/grub:  GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts rd.driver.pre=vfio-pci" specifically the part at the end.

- Also, in /etc/modprobe.d/vfio_pci.conf I now try to reserve all the nvidia cards, leaving none for the host:
options vfio_pci disable_vga=1
options vfio-pci ids=10de:13c2,10de:0fbb,10de:11c0,10de:0e0b

-I blacklisted the nouveau and the nvidia module too for good measure:  
# cat /etc/modprobe.d/blacklist.conf
blacklist nouveau
blacklist nvidia

... now the devices on buses 3, 4, and 5 dsplay "Kernel driver in use: vfio-pci" in lspci


-And, I added the following to /etc/initramfs-tools/modules:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd


-Then I rebuilt my initrd (to avoid a mistake) by apt-get remove/installing my kernel again.

-update-grub

-reboot



For lack of a better test, I started dota on a VM with the 660, and started a "demo match".  I then started the new DOOM (incidentally purchased just for this test) on the 970.  Both games were set to max settings, the latter was at 4k.  

I was getting about 40fps in doom as the dota demo matched was running on the other vm.  Then DOOM's 970 driver crashed at the last stage of loading the first level, and as I shut down that VM, the graphics driver for the 660 running dota also crashed.  There was less than a minute between the first signs of trouble and both VM's GPU driver failing.

The audio popped and stuttered periodically on the doom side, enough to be noticeable, before it crashed.  However, before I enabled MSI in the guest, even the "bong" of adjusting the volume at the windows guest would stutter and lag, while running no 3D accelerated apps with only one VM running.  I feel like that should have worked irrespective of interrupts.

Perhaps I'm still not fully sequestering the GPUs from the host?

In the next days I'll try to be more methodical, in terms of trying the all of the most simple hardware combinations and recreating the VMs from scratch.  I didn't really start testing this in earnest until I got the 970, but now that I think back, it seemed to work better with just the 660s.  That's another thing I can try.  I'm sure something else will come to mind, but for now that is all I can think of.

Thanks again for the responses.




More information about the vfio-users mailing list