[vfio-users] TITAN X won't reset

vfio vfio at taintedbit.com
Mon Feb 22 08:25:18 UTC 2016


Hello everyone,

I have VGA passthrough working with a single GTX TITAN X card and a 
Debian (sid) host. However, the card stops working after the VM shuts 
down; the VM only works for the first attempt after booting the host. 
The issue persists across reboots of the host, as well. As far as I can 
tell, the most reliable way to bring the card back to life is to shut 
down the machine and cycle the power supply.

Moreover, the card also has a tendency to switch the PCI bus it is 
assigned to---it changes between 03:00 and 04:00---even though the 
hardware is not changing. I'm not sure if this is related.

I noticed a posting from "S B" in January about a similar problem with 
an AMD card, but it didn't seem to have a resolution.

When the card is dead, launching the VM causes this error:
   qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
   qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
   qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
   qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
   qemu-system-x86_64: vfio-pci: Cannot read device rom at 0000:04:00.0
   Device option ROM contents are probably invalid (check dmesg).
   Skip option ROM probe with rombar=0, or load from file with romfile=

dmesg shows this (annotated):
   # After switching from pci-stub to vfio:
   [  150.737259] vgaarb: device changed decodes: 
PCI:0000:04:00.0,olddecodes=io+mem,decodes=io+mem:owns=none
   # When starting the VM:
   [  198.572550] vfio-pci 0000:04:00.0: enabling device (0000 -> 0003)
   [  198.572607] vfio_ecap_init: 0000:04:00.0 hiding ecap 0x1e at 0x258
   [  198.572618] vfio_ecap_init: 0000:04:00.0 hiding ecap 0x19 at 0x900
   [  200.169203] vfio-pci 0000:04:00.0: Invalid ROM contents

Providing a romfile extracted from the card using GPU-Z in bare metal 
Windows does nothing except remove the romfile-related advice from the 
error message.

Here's an lspci immediately after booting. In this case, the card was 
pre-dead (i.e., even the first VM launch failed):
   root at debian:~# lspci -s 04:00.0 -v
   04:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce 
GTX TITAN X] (rev a1) (prog-if 00 [VGA controller])
     Subsystem: eVga.com. Corp. GM200 [GeForce GTX TITAN X]
     Flags: fast devsel, IRQ 65
     Memory at f6000000 (32-bit, non-prefetchable) [disabled] [size=16M]
     Memory at 90000000 (64-bit, prefetchable) [disabled] [size=256M]
     Memory at a0000000 (64-bit, prefetchable) [disabled] [size=32M]
     I/O ports at c000 [disabled] [size=128]
     Expansion ROM at f7000000 [disabled] [size=512K]
     Capabilities: [60] Power Management version 3
     Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
     Capabilities: [78] Express Legacy Endpoint, MSI 00
     Capabilities: [100] Virtual Channel
     Capabilities: [250] Latency Tolerance Reporting
     Capabilities: [258] L1 PM Substates
     Capabilities: [128] Power Budgeting <?>
     Capabilities: [420] Advanced Error Reporting
     Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 
Len=024 <?>
     Capabilities: [900] #19
     Kernel driver in use: vfio-pci
     Kernel modules: nouveau

(If I run lspci with -vvv, it shows the card in the D0 power state)

Here's lspci after the first VM attempt failed:
   root at debian:~# lspci -s 04:00.0 -vvv
   04:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce 
GTX TITAN X] (rev ff) (prog-if ff)
     !!! Unknown header type 7f
     Kernel driver in use: vfio-pci
     Kernel modules: nouveau

Here's my qemu command:
sudo qemu-system-x86_64 -enable-kvm -m 4096 -cpu host,kvm=off -smp 
4,sockets=1,cores=4,threads=1 -drive 
if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_CODE.fd -drive 
if=pflash,format=raw,file=/usr/share/OVMF/OVMF_VARS.fd -drive 
file=/dev/sdf,format=raw -soundhw hda -usb -device 
usb-host,hostbus=10,hostport=1.7.3 -device 
usb-host,hostbus=10,hostport=1.7.4 -device vfio-pci,host=04:00.0 -device 
vfio-pci,host=04:00.1 -vga none

Does anybody know why this might be happening?

Thanks!




More information about the vfio-users mailing list