[vfio-users] Need help with GPU Passthrough on Ryzen C6H + GTX 980 Ti + GTX 1060 6G

Alex Williamson alex.l.williamson at gmail.com
Thu Jul 6 05:20:01 UTC 2017


On Wed, Jul 5, 2017 at 10:23 PM, Thiago Ramon <thiagoramon at gmail.com> wrote:
>
>
> Here, dropped the raw message in pastebin: https://pastebin.com/hfJ6ryJg
>
> That particular run was trying to pass the 980 Ti, which is the boot
> device, and which probably had something else prodding at it (I'll give it
> a try again and check what else was attaching to it). I've mostly focused
> on passing the 1060 though, which doesn't get touched by anything but
> vfio-pci, and also doesn't show any mmap issues, here's the last QEMU run
> with SeaBIOS:
>
> https://pastebin.com/DEPpewCH
>
> And the last one from OVMF:
>
> https://pastebin.com/L7gkrm36
>
> On the kernel log, I only get the vfio_bar_restore messages. One
> interesting and consistent pattern is that SeaBIOS always generate 2 pairs
> of warnings (one for GPU, one audio), while OVMF generates quite a bit
> (dozen+, don't have a log handy). Probably not relevant, as apparently the
> failure happens before the first message anyway.
>
> Another detail that may be relevant: Whenever I try a passthrough (and
> fail), the kernel fails to soft restart. It gets to the last stage where it
> would do a soft reset but the console just sits there. Could this just be
> vfio_pci trying to do something with the unresponsive card, or something
> else that may be a clue to what's going on?
>

Yep, here's what I suspected about the D3 warning:

>PCI state after passthrough attempt:
> 29:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200
[GeForce GTX 980 Ti] [10de:17c8] (rev ff) (prog-if ff)
>   !!! Unknown header type 7f
>   Kernel driver in use: vfio-pci
>   Kernel modules: nouveau, nvidia_drm, nvidia
>
> 29:00.1 Audio device [0403]: NVIDIA Corporation GM200 High Definition
Audio [10de:0fb0] (rev ff) (prog-if ff)
>   !!! Unknown header type 7f
>   Kernel driver in use: vfio-pci
>   Kernel modules: snd_hda_intel

The card isn't actually stuck in D3, it's basically disappeared from the
bus and all reads from config space are returning -1, which is
indistinguishable from from D3 power state for the bits that tell us the
power state.  This is probably the result of doing a bus reset, but that's
also our only way of putting the device back to a known state before
starting it in the VM.  You might try to see if you can reproduce this
result manually with setpci.  We do a bus reset by finding the bridge
upstream of the device, lspci -t is handy for this with a tree view of the
PCI topology.  As an example:

https://pastebin.com/c3URT6vx

Bus numbers are shown in brackets, so if I want the parent bridge of device
01:00.0, look to the left of [01]--00.0 to find 01.0.  This is attached to
the root bus at [0000:00], so the full address of the parent bridge is
0000:00:01.0.

We can access the bridge control register using

# setpci -s 0000:00:01.0 BRIDGE_CONTROL

The secondary bus reset bit is 0x40.  We want to set this bit:

# setpci -s 0000:00:01.0 BRIDGE_CONTROL=40:40

Then clear it:

# setpci -s 0000:00:01.0 BRIDGE_CONTROL=00:40

Then run lspci on the bus to see if the device is still present.  In your
case it would be bus 29, so you'd run

# lspci -vvv -s 0000:29:

Do you get output like above with the 'Unknown header type 7f' or a
complete listing of the device?  Be sure to reboot the system after running
this test, regardless of the result the device will be re-initialized, and
clearly nothing should be using the device while doing this.  If the
graphics card doesn't recover from a bus reset, then something about this
system setup is not compatible with this use case.  Thanks,

Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20170705/2111cf84/attachment.htm>


More information about the vfio-users mailing list