[vfio-users] Bus reset trouble with Titan-X

Alex Williamson alex.williamson at redhat.com
Tue Oct 18 23:03:06 UTC 2016


On Tue, 18 Oct 2016 17:48:59 -0500
Kevin Vasko <kvasko at gmail.com> wrote:

> Alex,
> 
> I think I was able to do it successfully and was scucessfully able to make
> the thing fail. It went from (rev a1) to (rev ff) with response of the
> header error.
> 
> Instead of doing all devices I just did 1 at a time.
> 
> this was the output of
> 
> # lspci -tv
> 
> +-02.0-[02-08]----00.0-[03-08]--+-00.0-[04]--+--00.0  NVIDIA Corporation
> GM200 [GeForce GTX TITAN X]
>                                             |                 \-00.1
> NVIDIA Corporation Device efb0
>                                             +-04.0-[05]--+--00.0  NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
>                                             |                 \-00.1
> NVIDIA Corporation Device efb0
>                                             +-08.0-[06]--+--00.0  NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
>                                             |                 \-00.1
> NVIDIA Corporation Device efb0
>                                             +-0c.0-[07]--+--00.0  NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
>                                             |                 \-00.1
> NVIDIA Corporation Device efb0
>                                             +-14.0-[08]----00.0   Mellanox
> Technologies MT27600 Family [ConnectX-3]
> +-03.0-[09-12]----00.0-[0a-12]--+-08.0-[0b-11]----00.0-[0c-11]--+--00.0-[0d]--+-00.0
>  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
> 
>           |                  \-00.1  NVIDIA Corporation Device 0fb0
> 
>           +--04.0-[0e]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
> 
>           |                  \-00.1  NVIDIA Corporation Device 0fb0
> 
>           +--08.0-[0f]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
> 
>           |                  \-00.1  NVIDIA Corporation Device 0fb0
> 
>           +--0c.0-[10]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
> 
>           |                  \-00.1  NVIDIA Corporation Device 0fb0
> 
> I tried the first device
> # virsh nodedev-detach --driver=kvm pci_0000_04_00_0
> Device pci_0000_04_00_0 detached
> 
> # virsh nodedev-detach --driver=kvm pci_0000_04_00_1
> Device pci_0000_04_00_1 detached
> 
> In the script I put
> 
> DEVS=(
>             03:00.0
>             04
> )
> 
> Ran it 100 times and got no error.
> 
> Ran it for a different device 05
> 
> 
> 
> # virsh nodedev-detach --driver=kvm pci_0000_05_00_0
> Device pci_0000_05_00_0 detached
> 
> # virsh nodedev-detach --driver=kvm pci_0000_05_00_1
> Device pci_0000_05_00_1 detached
> 
> DEVS=(
>             03:04.0
>             05:
> )
> 
> 
> I saw this.
> 
> #: for i in $(seq 1 100); do ./reset.sh; done
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev ff)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff)
> 
> I repeated this with another device on the system.
> 
> I assume this indicates that that the device is not resetting properly? The
> question is where do I go from here? Would this indicate a problem with the
> PCI Reset code or a problematic hardware?

Right, the PCIe link is not coming back for some reason, that seems
like a hardware issue.  Can you attach the output of 'sudo lspci -vvvs
3:04.0' when you're in this state (replace with the appropriate parent
bridge depending on the failed device), maybe we can see if that
downstream port is stuck in training.

What I would do next is to test each card repeatedly.  Do only some
cards fail?  If so, swap a working card and a non-working card, does
the failure follow the card or the slot?  I'm not sure what the result
is going to be, but if we can't rely on a PCI bus reset then you're
really not going to have any repeat-ability with assigning the GPUs.
Thanks,

Alex




More information about the vfio-users mailing list