[vfio-users] Bus reset trouble with Titan-X
Alex Williamson
alex.williamson at redhat.com
Tue Oct 18 23:03:06 UTC 2016
On Tue, 18 Oct 2016 17:48:59 -0500
Kevin Vasko <kvasko at gmail.com> wrote:
> Alex,
>
> I think I was able to do it successfully and was scucessfully able to make
> the thing fail. It went from (rev a1) to (rev ff) with response of the
> header error.
>
> Instead of doing all devices I just did 1 at a time.
>
> this was the output of
>
> # lspci -tv
>
> +-02.0-[02-08]----00.0-[03-08]--+-00.0-[04]--+--00.0 NVIDIA Corporation
> GM200 [GeForce GTX TITAN X]
> | \-00.1
> NVIDIA Corporation Device efb0
> +-04.0-[05]--+--00.0 NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
> | \-00.1
> NVIDIA Corporation Device efb0
> +-08.0-[06]--+--00.0 NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
> | \-00.1
> NVIDIA Corporation Device efb0
> +-0c.0-[07]--+--00.0 NVIDIA
> Corporation GM200 [GeForce GTX TITAN X]
> | \-00.1
> NVIDIA Corporation Device efb0
> +-14.0-[08]----00.0 Mellanox
> Technologies MT27600 Family [ConnectX-3]
> +-03.0-[09-12]----00.0-[0a-12]--+-08.0-[0b-11]----00.0-[0c-11]--+--00.0-[0d]--+-00.0
> NVIDIA Corporation GM200 [GeForce GTX TITAN X]
>
> | \-00.1 NVIDIA Corporation Device 0fb0
>
> +--04.0-[0e]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
>
> | \-00.1 NVIDIA Corporation Device 0fb0
>
> +--08.0-[0f]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
>
> | \-00.1 NVIDIA Corporation Device 0fb0
>
> +--0c.0-[10]--+-00.0 NVIDIA Corporation GM200 [GeForce GTX TITAN
> X]
>
> | \-00.1 NVIDIA Corporation Device 0fb0
>
> I tried the first device
> # virsh nodedev-detach --driver=kvm pci_0000_04_00_0
> Device pci_0000_04_00_0 detached
>
> # virsh nodedev-detach --driver=kvm pci_0000_04_00_1
> Device pci_0000_04_00_1 detached
>
> In the script I put
>
> DEVS=(
> 03:00.0
> 04
> )
>
> Ran it 100 times and got no error.
>
> Ran it for a different device 05
>
>
>
> # virsh nodedev-detach --driver=kvm pci_0000_05_00_0
> Device pci_0000_05_00_0 detached
>
> # virsh nodedev-detach --driver=kvm pci_0000_05_00_1
> Device pci_0000_05_00_1 detached
>
> DEVS=(
> 03:04.0
> 05:
> )
>
>
> I saw this.
>
> #: for i in $(seq 1 100); do ./reset.sh; done
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev a1)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
> 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> TITAN X] (rev ff)
> 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff)
>
> I repeated this with another device on the system.
>
> I assume this indicates that that the device is not resetting properly? The
> question is where do I go from here? Would this indicate a problem with the
> PCI Reset code or a problematic hardware?
Right, the PCIe link is not coming back for some reason, that seems
like a hardware issue. Can you attach the output of 'sudo lspci -vvvs
3:04.0' when you're in this state (replace with the appropriate parent
bridge depending on the failed device), maybe we can see if that
downstream port is stuck in training.
What I would do next is to test each card repeatedly. Do only some
cards fail? If so, swap a working card and a non-working card, does
the failure follow the card or the slot? I'm not sure what the result
is going to be, but if we can't rely on a PCI bus reset then you're
really not going to have any repeat-ability with assigning the GPUs.
Thanks,
Alex
More information about the vfio-users
mailing list