[vfio-users] Need help with GPU Passthrough on Ryzen C6H + GTX 980 Ti + GTX 1060 6G

Thu Jul 6 16:46:08 UTC 2017

On Thu, Jul 6, 2017 at 2:20 AM, Alex Williamson <alex.l.williamson at gmail.com
> wrote:

> On Wed, Jul 5, 2017 at 10:23 PM, Thiago Ramon <thiagoramon at gmail.com>
> wrote:
>>
>>
>> Here, dropped the raw message in pastebin: https://pastebin.com/hfJ6ryJg
>>
>> That particular run was trying to pass the 980 Ti, which is the boot
>> device, and which probably had something else prodding at it (I'll give it
>> a try again and check what else was attaching to it). I've mostly focused
>> on passing the 1060 though, which doesn't get touched by anything but
>> vfio-pci, and also doesn't show any mmap issues, here's the last QEMU run
>> with SeaBIOS:
>>
>> https://pastebin.com/DEPpewCH
>>
>> And the last one from OVMF:
>>
>> https://pastebin.com/L7gkrm36
>>
>> On the kernel log, I only get the vfio_bar_restore messages. One
>> interesting and consistent pattern is that SeaBIOS always generate 2 pairs
>> of warnings (one for GPU, one audio), while OVMF generates quite a bit
>> (dozen+, don't have a log handy). Probably not relevant, as apparently the
>> failure happens before the first message anyway.
>>
>> Another detail that may be relevant: Whenever I try a passthrough (and
>> fail), the kernel fails to soft restart. It gets to the last stage where it
>> would do a soft reset but the console just sits there. Could this just be
>> vfio_pci trying to do something with the unresponsive card, or something
>> else that may be a clue to what's going on?
>>
>
> Yep, here's what I suspected about the D3 warning:
>
> >PCI state after passthrough attempt:
> > 29:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200
> [GeForce GTX 980 Ti] [10de:17c8] (rev ff) (prog-if ff)
> >   !!! Unknown header type 7f
> >   Kernel driver in use: vfio-pci
> >   Kernel modules: nouveau, nvidia_drm, nvidia
> >
> > 29:00.1 Audio device [0403]: NVIDIA Corporation GM200 High Definition
> Audio [10de:0fb0] (rev ff) (prog-if ff)
> >   !!! Unknown header type 7f
> >   Kernel driver in use: vfio-pci
> >   Kernel modules: snd_hda_intel
>
> The card isn't actually stuck in D3, it's basically disappeared from the
> bus and all reads from config space are returning -1, which is
> indistinguishable from from D3 power state for the bits that tell us the
> power state.  This is probably the result of doing a bus reset, but that's
> also our only way of putting the device back to a known state before
> starting it in the VM.  You might try to see if you can reproduce this
> result manually with setpci.  We do a bus reset by finding the bridge
> upstream of the device, lspci -t is handy for this with a tree view of the
> PCI topology.  As an example:
>
> https://pastebin.com/c3URT6vx
>
> Bus numbers are shown in brackets, so if I want the parent bridge of
> device 01:00.0, look to the left of [01]--00.0 to find 01.0.  This is
> attached to the root bus at [0000:00], so the full address of the parent
> bridge is 0000:00:01.0.
>
> We can access the bridge control register using
>
> # setpci -s 0000:00:01.0 BRIDGE_CONTROL
>
> The secondary bus reset bit is 0x40.  We want to set this bit:
>
> # setpci -s 0000:00:01.0 BRIDGE_CONTROL=40:40
>
> Then clear it:
>
> # setpci -s 0000:00:01.0 BRIDGE_CONTROL=00:40
>
> Then run lspci on the bus to see if the device is still present.  In your
> case it would be bus 29, so you'd run
>
> # lspci -vvv -s 0000:29:
>
> Do you get output like above with the 'Unknown header type 7f' or a
> complete listing of the device?  Be sure to reboot the system after running
> this test, regardless of the result the device will be re-initialized, and
> clearly nothing should be using the device while doing this.  If the
> graphics card doesn't recover from a bus reset, then something about this
> system setup is not compatible with this use case.  Thanks,
>
> Alex
>

Ok, did some more testing. First thing I did was from having my 2 cards
bound to the NVidia driver, shut down X, rmmod nvidia, bound my secondary
card to vfio-pci and tried to reset the bus. It indeed failed to reset
properly and got stuck.
Then I tried switching out to my primary passthrough setup, to see what was
grabbing the card memory, which turned out to be vesafb, even though I've
disabled it.
After adding a bunch more options to the boot command line, I've managed to
properly block it from anything else, and proceeded to test the bus reset,
which worked this time.
Then I tried running the VM (without external BIOS) which failed, but
complained about not accessing the BIOS.
Rebooted again and tried with a pre-dumped BIOS, and it still failed in the
same way as before.

Returning to my secondary card, I've tried to reset the bus again, this
time from a fresh boot, which seems to have worked fine. Here are the logs:

https://pastebin.com/94F5wURY

I've proceeded to reset the bus a few times, to see if it was a problem,
but at least half a dozen resets don't seem to have caused any problems.
Any other ideas?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20170706/5ed0d60c/attachment.htm>