[vfio-users] Bus reset trouble with Titan-X

Kevin Vasko kvasko at gmail.com
Tue Oct 18 16:04:14 UTC 2016


Alex,

(crossing fingers this goes into the correct thread).

I upgraded this machine to 4.4.0-42-generic.

I spawned a single VM with 1 GPU immediately after the kernel upgrade. It
works. It attached properly and in the VM when I ran lspci, it showed up
properly.

I deleted that VM and started up the system with 4x GPUs, and then it
started exhibiting the same issue. Three of the GPUs attached properly.

This appears to be that it was not resolved with upgrading the kernel. If
you don't mind providing instructions on resetting the bus to see if I can
narrow this down further (what you were talking about yesterday) that would
be appreciated. Any other suggestions would be greatly appreciated as well.

Here are the logs of the 4 GPU attachment that failed.

On the host.

/etc/var/log/libvirt/qemu/instance-00000185.log

this shows the /usr/bin/kvm command issuing the connection of the following
devices

-device vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5
-device vfio-pci,host=10:00.0,id=hostdev1,bus=pci.0,addr=0x6
-device vfio-pci,host=0e:00.0,id=hostdev2,bus=pci.0,addr=0x7
-device vfio-pci,host=0d:00.0,id=hostdev3,bus=pci.0,addr=0x8


lspci -vnnn -d 10de:17c2 (on the host, I omitted the other 4 GPUs)


0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

     subsystem: NVIDIA Corporation Device [10de:1132]

     Flags: fast devsel, IRQ 28

     Memory at b9000000 (32-bit, non-prefetchable) [size=16M]

     Memory at 38ff20000000 (64-bit, prefetchable) [size=256M]

     Memory at 38ff30000000 (64-bit, prefetchable) [size=32M]

     I/O ports at 3000 [size=128]

     Expansion ROM at ba000000 [disabled] [size=512k]

     Capabilities: [60] Power Management version 3

     Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+

     Capabilities: [78] Express Legacy Endpoint, MSI 00

     Capabilities: [100] Express Legacy Endpoint, MSI 00

     Capabilities: [250] Latency Tolerance Reporting

     Capabilities: [258] L1 PM Substates

     Capabilities: [128] Power Budgeting <?>

     Capabilities: [420] Advanced Error Reporting

     Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
<?>

     Capabilities: [900] #19

     Kernel driver in use: vfio-pci

0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

     subsystem: NVIDIA Corporation Device [10de:1132]

     Flags: fast devsel, IRQ 28

     Memory at b9000000 (32-bit, non-prefetchable) [size=16M]

     Memory at 38ff20000000 (64-bit, prefetchable) [size=256M]

     Memory at 38ff30000000 (64-bit, prefetchable) [size=32M]

     I/O ports at 3000 [size=128]

     Expansion ROM at ba000000 [disabled] [size=512k]

     Capabilities: [60] Power Management version 3

     Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+

     Capabilities: [78] Express Legacy Endpoint, MSI 00

     Capabilities: [100] Express Legacy Endpoint, MSI 00

     Capabilities: [250] Latency Tolerance Reporting

     Capabilities: [258] L1 PM Substates

     Capabilities: [128] Power Budgeting <?>

     Capabilities: [420] Advanced Error Reporting

     Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
<?>

     Capabilities: [900] #19

     Kernel driver in use: vfio-pci


0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)

            !!! Unknown header type 7f

            Kernel driver in use: vfio-pci


10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

     subsystem: NVIDIA Corporation Device [10de:1132]

     Flags: fast devsel, IRQ 28

     Memory at b9000000 (32-bit, non-prefetchable) [size=16M]

     Memory at 38ff20000000 (64-bit, prefetchable) [size=256M]

     Memory at 38ff30000000 (64-bit, prefetchable) [size=32M]

     I/O ports at 3000 [size=128]

     Expansion ROM at ba000000 [disabled] [size=512k]

     Capabilities: [60] Power Management version 3

     Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+

     Capabilities: [78] Express Legacy Endpoint, MSI 00

     Capabilities: [100] Express Legacy Endpoint, MSI 00

     Capabilities: [250] Latency Tolerance Reporting

     Capabilities: [258] L1 PM Substates

     Capabilities: [128] Power Budgeting <?>

     Capabilities: [420] Advanced Error Reporting

     Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
<?>

     Capabilities: [900] #19

     Kernel driver in use: vfio-pci


On the VM guest:


lspci


00:06.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev a1)

00:07.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev a1)

00:08.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev a1)

dmesg


[    0.787786] pci 0000:00:05.0: [10de:17c2] type 7f class 0xffffff

[    0.788970] pci 0000:00:06.0: [10de:17c2] type 00 class 0x030000

[    0.855192] pci 0000:00:07.0: [10de:17c2] type 00 class 0x030000

[    0.925003] pci 0000:00:08.0: [10de:17c2] type 00 class 0x030000




On Mon, Oct 17, 2016 at 11:10 PM, Kevin Vasko <kvasko at gmail.com> wrote:

> Thanks. I'm an idiot. I just replied to the email directly after the
> subscription and wasn't paying attention. Thank you for correcting it.
>
> I was originally running 3.13.0-86-generic upgraded to the 3.19 version to
> try before I posted this, but got the same results. I'll try a newer
> version of the kernel and see what happens.
>
> Sorry to be dense but what do you mean by "retrain properly"? I assume you
> mean that once it fails to reset it just never recovers?
>
> We have 2 other machines that I've never seen this problem with so what
> what you are saying makes sense. This system does have a slightly more
> specialized PCI bus to be able to stick 8 cards on a single bus (at least
> that is my understanding), so at this point, either I'm hitting a bug that
> is fixed in the kernel, or this PCI bus is not doing something that
> vfio-pci is expecting (would be my speculation).
>
> I'll report back my findings tomorrow.
>
> Thanks for the help.
>
> -Kevin
>
>
>
>
>
>
> On Mon, Oct 17, 2016 at 5:53 PM, Alex Williamson <
> alex.williamson at redhat.com> wrote:
>
>> (generally a good idea to have a useful subject line)
>>
>> On Mon, 17 Oct 2016 16:26:15 -0500
>> Kevin Vasko <kvasko at gmail.com> wrote:
>> >
>> > Any suggestions on debugging a !!! Unknown header type 7f?
>> >
>>
>> This usually means that the device didn't come back from bus reset and
>> re-reading the PCI config space where the device was just gives a -1
>> response.  lspci tries to interpret that bogus data and gives results
>> like you see.  You might try a newer kernel, we've probably fixed some
>> things in the bus reset path since v3.19.  It looks like you continue
>> to see the bogus data once it gets into this state, so it's probably
>> not a "simple" device coming out of reset too slowly problem.  Possibly
>> the PCIe link doesn't retrain properly sometimes after a bus reset.  If
>> a new kernel doesn't help, I could give you instructions for performing
>> a bus reset with setpci and you could test how reliably you can reset
>> the device and read config space after.  Thanks,
>>
>> Alex
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20161018/c84497ee/attachment.htm>


More information about the vfio-users mailing list