[vfio-users] Stability issues with GTX 970 and GTX 660, Nvidia driver crashes often, seemingly under load

Brian Yglesias brian at atlanticdigitalsolutions.com
Thu Jul 7 18:01:52 UTC 2016


I've been trying to get GPU passthrough to work more reliably for a few days.

I have an Asus Rampage III Forumula (X58 chipset LGA1366) with latest bios, Xeon X5670, kernel 4.4.13, quemu 2.5.1.1.  I'm passing through a GTX 660 and a GTX 970, sometimes to two different VMs, and sometimes to the same one.

The invocation of kvm I'm having the most luck with is:

/usr/bin/kvm \
-id 110 \
-chardev socket,id=qmp,path=/var/run/qemu-server/110.qmp,server,nowait \
-mon chardev=qmp,mode=control \
-pidfile /var/run/qemu-server/110.pid \
-daemonize \
-smbios type=1,uuid=aecb408f-89ef-44ef-9a7a-a7fa9d6f75f8 \
-drive if=pflash,format=raw,readonly,file=/usr/share/kvm/OVMF-pure-efi.fd \
-drive if=pflash,format=raw,file=/tmp/110-OVMF_VARS-pure-efi.fd \
-name Brian-PC \
-smp 8,sockets=1,cores=8,maxcpus=8 \
-nodefaults \
-boot menu=on,strict=on,reboot-timeout=1000 \
-vga none \
-nographic \
-no-hpet \
-cpu host,+kvm_pv_unhalt,+kvm_pv_eoi,kvm=off \
-m 8196 \
-k en-us \
-readconfig /usr/share/qemu-server/pve-q35.cfg \
-device usb-tablet,id=tablet,bus=ehci.0,port=1 \
-device vfio-pci,host=04:00.0,id=hostpci0,bus=ich9-pcie-port-1,addr=0x0 \
-device vfio-pci,host=04:00.1,id=hostpci1,bus=ich9-pcie-port-2,addr=0x0 \
-device vfio-pci,host=05:00.0,id=hostpci2,bus=ich9-pcie-port-3,addr=0x0 \
-device vfio-pci,host=05:00.1,id=hostpci3,bus=ich9-pcie-port-4,addr=0x0 \
-device usb-host,hostbus=1,hostport=6.1,id=usb0 \
-device usb-host,hostbus=1,hostport=6.2,id=usb1 \
-device usb-host,hostbus=1,hostport=6.3,id=usb2 \
-device usb-host,hostbus=1,hostport=6.4,id=usb3 \
-device usb-host,hostbus=1,hostport=6.5,id=usb4 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \
-iscsi initiator-name=iqn.1993-08.org.debian:01:3f1e9afe6fdb \
-drive file=/dev/zvol/rpool/data/vm-110-disk-1,if=none,id=drive-virtio0,cache=writeback,format=raw,aio=threads,detect-zeroes=on \
-device virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100 \
-drive file=/dev/zvol/tank0/vm-110-disk-1,if=none,id=drive-virtio1,cache=writeback,format=raw,aio=threads,detect-zeroes=on \
-device virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb \
-netdev type=tap,id=net0,ifname=tap110i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on \
-device virtio-net-pci,mac=62:63:65:65:32:31,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300 \
-rtc driftfix=slew,base=localtime \
-machine type=q35 \
-global kvm-pit.lost_tick_policy=discard





...  however, I've tried with and without the hv_stuff (my distro wants to leave it in, and set hv_vendor_id=proxmox), with and without setting hv_vendor_id to <arbitrary> and Nvidia43FIX.

I allow unsafe interrupts, otherwise I cannot start the VMs.

I set anything that could be MSI to MSI in the windows guests.  Before doing that the audio was severely stuttered and laggy to the point of not being intelligible, and crashes were much more frequent.  The audio still occasionally sputters under load.

There is another 660 on Bus 3, which is in use by the Host, though I doubt that makes a difference.  I set the driver-override when loading the vfio module, and do not so much as run X on the guest at the moment, though I do load the nouveau driver.  Maybe I should try assigning all the GPUs to the vfio module?

Interestingly, a good way to cause one or both VMs to crash is to use them both for a 3D intensive at the same time, such as starting a video game or running a benchmark.  Stability drops precipitously in that case.




Thanks in advance for any pointers.  What follows is some of my logs:



# dmesg |grep -e IOMMu -e DMAR

Jul  7 02:15:16 ads-proxmox-1 kernel: [    0.000000] ACPI: DMAR 0x000000009F7980C0 000130 (v01 AMI    OEMDMAR  00000001 MSFT 00000097)
Jul  7 02:15:16 ads-proxmox-1 kernel: [    0.000000] DMAR: IOMMU enabled
Jul  7 02:15:16 ads-proxmox-1 kernel: [    0.142293] DMAR-IR: This system BIOS has enabled interrupt remapping
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151114] DMAR: Host address width 39
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151167] DMAR: DRHD base: 0x000000fbffe000 flags: 0x1
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151237] DMAR: dmar0: reg_base_addr fbffe000 ver 1:0 cap c90780106f0462 ecap f020fe
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151308] DMAR: RMRR base: 0x000000000ec000 end: 0x000000000effff
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151364] DMAR: RMRR base: 0x0000009f7da000 end: 0x0000009f7d9fff
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151420] DMAR: ATSR flags: 0x0
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151682] DMAR: dmar0: Using Queued invalidation
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151748] DMAR: Setting RMRR:
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.151820] DMAR: Setting identity map for device 0000:00:1a.0 [0x9f7da000 - 0x9f7d9fff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.154155] DMAR: Setting identity map for device 0000:00:1a.1 [0x9f7da000 - 0x9f7d9fff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.156471] DMAR: Setting identity map for device 0000:00:1a.2 [0x9f7da000 - 0x9f7d9fff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.164309] DMAR: Setting identity map for device 0000:00:1a.7 [0x9f7da000 - 0x9f7d9fff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.169324] DMAR: Setting identity map for device 0000:00:1d.0 [0x9f7da000 - 0x9f7d9fff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.171630] DMAR: Setting identity map for device 0000:00:1d.1 [0x9f7da000 - 0x9f7d9fff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.173934] DMAR: Setting identity map for device 0000:00:1d.2 [0x9f7da000 - 0x9f7d9fff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.176243] DMAR: Setting identity map for device 0000:00:1d.7 [0x9f7da000 - 0x9f7d9fff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.178546] DMAR: Setting identity map for device 0000:00:1a.0 [0xec000 - 0xeffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.178645] DMAR: Setting identity map for device 0000:00:1a.1 [0xec000 - 0xeffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.178742] DMAR: Setting identity map for device 0000:00:1a.2 [0xec000 - 0xeffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.178838] DMAR: Setting identity map for device 0000:00:1a.7 [0xec000 - 0xeffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.178933] DMAR: Setting identity map for device 0000:00:1d.0 [0xec000 - 0xeffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.184352] DMAR: Setting identity map for device 0000:00:1d.1 [0xec000 - 0xeffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.187147] DMAR: Setting identity map for device 0000:00:1d.2 [0xec000 - 0xeffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.187243] DMAR: Setting identity map for device 0000:00:1d.7 [0xec000 - 0xeffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.187332] DMAR: Prepare 0-16MiB unity mapping for LPC
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.187396] DMAR: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
Jul  7 02:15:16 ads-proxmox-1 kernel: [    1.187616] DMAR: Intel(R) Virtualization Technology for Directed I/O

...  there's also some of this:


Jul  7 02:23:03 ads-proxmox-1 kernel: [  481.694387] vfio_ecap_init: 0000:04:00.0 hiding ecap 0x19 at 0x900
Jul  7 02:23:03 ads-proxmox-1 kernel: [  481.730226] vfio_ecap_init: 0000:05:00.0 hiding ecap 0x19 at 0x900




More information about the vfio-users mailing list