[vfio-users] Bus reset trouble with Titan-X

Kevin Vasko kvasko at gmail.com
Wed Oct 19 15:00:57 UTC 2016


Sure thing. I'm attaching all of the logs I have to let you get a bigger
picture (and anyone that might run into a similar issue). Hopefully I
didn't mess anything up.

Unfortunately, I've seen almost every single device fail at one point or
another. I was thinking it might be isolated to a single PLX Riser card but
I have now seen devices fail on every single parent device at one time or
another. Based on that, I don't think I could narrow it down to a single
PCISlot/PLX Riser that is the culprit. Unless both of these boards are bad,
my conclusion is that this indicates a problem with the hardware as well. I
completely agree that if the PCI Bus reset isn't working properly, nothing
is going to work.

I sent these steps to the manufacturer to see if they could reproduce the
issue on their end. If they can then they will need to investigate on their
end why the problem exists. If they can't, it is possible we have a bad set
of boards in this machine.

Thank you so much for your help. Really appreciate it.

-Kevin

#: lspci -tv

 |           +-1f.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 |           \-1f.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 VCU
 \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 DMI2
             +-01.0-[01]--
             +-02.0-[02-08]----00.0-[03-08]--+-00.0-[04]--+-00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
             |                               |
 \-00.1  NVIDIA Corporation Device 0fb0
             |                               +-04.0-[05]--+-00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
             |                               |                  \-00.1
 NVIDIA Corporation Device 0fb0
             |                               +-08.0-[06]--+-00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
             |                               |                  \-00.1
 NVIDIA Corporation Device 0fb0
             |                               +-0c.0-[07]--+-00.0  NVIDIA
Corporation GM200 [GeForce GTX TITAN X]
             |                               |                   \-00.1
 NVIDIA Corporation Device 0fb0
             |                               \-14.0-[08]----00.0  Mellanox
Technologies MT27500 Family [ConnectX-3]

 +-03.0-[09-12]----00.0-[0a-12]--+-08.0-[0b-11]----00.0-[0c-11]--+-00.0-[0d]--+-00.0
 NVIDIA Corporation GM200 [GeForce GTX TITAN X]
             |                               |
  |                                         \-00.1  NVIDIA Corporation
Device 0fb0
             |                               |
  +-04.0-[0e]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
             |                               |
  |                   \-00.1  NVIDIA Corporation Device 0fb0
             |                               |
  +-08.0-[0f]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
             |                               |
  |                 \-00.1  NVIDIA Corporation Device 0fb0
             |                               |
  +-0c.0-[10]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
             |                               |
  |                   \-00.1  NVIDIA Corporation Device 0fb0
             |                               |
  \-14.0-[11]--+-00.0  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
             |                               |
               +-00.1  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
             |                               |
               +-00.2  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
             |                               |
               +-00.3  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
             |                               |
               +-00.4  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
             |                               |
               +-00.5  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
             |                               |
               +-00.6  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
             |                               |
               \-00.7  Broadcom Corporation NetXtreme II BCM57810 10
Gigabit Ethernet Multi Function
             |                               \-10.0-[12]--
             +-05.0  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7
Address Map, VTd_Misc, System Management
             +-05.1  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Hot
Plug
             +-05.2  Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 RAS,
Control Status and Global Errors


# showing which ones are in failed state
:# lspci -vnnn | grep NVIDIA

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)
04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev ff)
(prog-if ff)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)
05:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev ff)
(prog-if ff)
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:1132]
06:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1132]
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:1132]
07:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1132]
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:1132]
0d:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1132]
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:1132]
0e:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1132]
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:1132]
0f:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1132]
10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:1132]
10:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:0fb0] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1132]


#showing parent bridge of a device that has a failed
#:lspci -vvvs 03:00
03:00.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00
[Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=03, secondary=04, subordinate=04, sec-latency=0
I/O behind bridge: 00009000-00009fff
Memory behind bridge: c5000000-c60fffff
Prefetchable memory behind bridge: 000038ffe0000000-000038fff1ffffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
Address: 00000000fee003b8  Data: 0000
Masking: 000000ff  Pending: 00000000
Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00
DevCap: MaxPayload 2048 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
L0s <4us, L1 <8us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt-
ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #32, PowerLimit 75.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet+ LinkState-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Via
message ARIFwd+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
ARIFwd-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, Selectable
De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance-
ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+,
EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a4] Subsystem: PLX Technology, Inc. Device 3577
Capabilities: [100 v1] Device Serial Number ab-87-00-10-b5-df-0e-00
Capabilities: [fb4 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 1f, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [138 v1] Power Budgeting <?>
Capabilities: [10c v1] #19
Capabilities: [148 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=8
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=03 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64+ WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=WRR64 TC/VC=01
Status: NegoPending+ InProgress-
Port Arbitration Table <?>
Capabilities: [e00 v1] #12
Capabilities: [f24 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+
DirectTrans+
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl-
DirectTrans-
Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 Len=010
<?>
Kernel driver in use: pcieport

#showing secondary device of 03:00 (parent) which is in failed state
#: lspci -vvvs 04:00.0
04:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: pci-stub

#showing secondary device of 03:00 (parent) of .1 device (audio adapter)
that is in failed state
#: lspci -vvvs 04:00.1
04:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: pci-stub


#showing parent device that has a NON failed device
#: lspci -vvvs 03:08
03:08.0 PCI bridge: PLX Technology, Inc. Device 8796 (rev ab) (prog-if 00
[Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Bus: primary=03, secondary=06, subordinate=06, sec-latency=0
I/O behind bridge: 00007000-00007fff
Memory behind bridge: c1000000-c20fffff
Prefetchable memory behind bridge: 000038ffa0000000-000038ffb1ffffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
Address: 00000000fee003f8  Data: 0000
Masking: 000000ff  Pending: 00000000
Capabilities: [68] Express (v2) Downstream Port (Slot+), MSI 00
DevCap: MaxPayload 2048 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 128 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
L0s <4us, L1 <8us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt-
ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #32, PowerLimit 75.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
Changed: MRL- PresDet- LinkState-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Via
message ARIFwd+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
ARIFwd-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, Selectable
De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance-
ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+,
EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a4] Subsystem: PLX Technology, Inc. Device 3577
Capabilities: [100 v1] Device Serial Number ab-87-00-10-b5-df-0e-00
Capabilities: [fb4 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 1f, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [138 v1] Power Budgeting <?>
Capabilities: [10c v1] #19
Capabilities: [148 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=8
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=03 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64+ WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=WRR64 TC/VC=01
Status: NegoPending- InProgress-
Port Arbitration Table <?>
Capabilities: [e00 v1] #12
Capabilities: [f24 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl+
DirectTrans+
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl-
DirectTrans-
Capabilities: [b70 v1] Vendor Specific Information: ID=0001 Rev=0 Len=010
<?>
Kernel driver in use: pcieport

#showing secondary device of 03:08 which is NON failed state
#: lspci -vvvs 06:00.0
06:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
TITAN X] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 1132
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 5
Region 0: Memory at c1000000 (32-bit, non-prefetchable) [disabled]
[size=16M]
Region 1: Memory at 38ffa0000000 (64-bit, prefetchable) [disabled]
[size=256M]
Region 3: Memory at 38ffb0000000 (64-bit, prefetchable) [disabled]
[size=32M]
Region 5: I/O ports at 7000 [disabled] [size=128]
Expansion ROM at c2000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000  Data: 0000
Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt-
ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance-
ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+,
EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [250 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
 PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024
<?>
Capabilities: [900 v1] #19
Kernel driver in use: pci-stub


#showing secondary device of 03:08 of .1 device (audio adapter) that is in
NON failed state
#: lspci -vvvs 06:00.1
06:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
Subsystem: NVIDIA Corporation Device 1132
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin B routed to IRQ 3
Region 0: Memory at c2080000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000  Data: 0000
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency
L0s <1us, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt-
ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-,
EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Kernel driver in use: pci-stub



On Tue, Oct 18, 2016 at 6:03 PM, Alex Williamson <alex.williamson at redhat.com
> wrote:

> On Tue, 18 Oct 2016 17:48:59 -0500
> Kevin Vasko <kvasko at gmail.com> wrote:
>
> > Alex,
> >
> > I think I was able to do it successfully and was scucessfully able to
> make
> > the thing fail. It went from (rev a1) to (rev ff) with response of the
> > header error.
> >
> > Instead of doing all devices I just did 1 at a time.
> >
> > this was the output of
> >
> > # lspci -tv
> >
> > +-02.0-[02-08]----00.0-[03-08]--+-00.0-[04]--+--00.0  NVIDIA Corporation
> > GM200 [GeForce GTX TITAN X]
> >                                             |                 \-00.1
> > NVIDIA Corporation Device efb0
> >                                             +-04.0-[05]--+--00.0  NVIDIA
> > Corporation GM200 [GeForce GTX TITAN X]
> >                                             |                 \-00.1
> > NVIDIA Corporation Device efb0
> >                                             +-08.0-[06]--+--00.0  NVIDIA
> > Corporation GM200 [GeForce GTX TITAN X]
> >                                             |                 \-00.1
> > NVIDIA Corporation Device efb0
> >                                             +-0c.0-[07]--+--00.0  NVIDIA
> > Corporation GM200 [GeForce GTX TITAN X]
> >                                             |                 \-00.1
> > NVIDIA Corporation Device efb0
> >                                             +-14.0-[08]----00.0
>  Mellanox
> > Technologies MT27600 Family [ConnectX-3]
> > +-03.0-[09-12]----00.0-[0a-12]--+-08.0-[0b-11]----00.0-[0c-
> 11]--+--00.0-[0d]--+-00.0
> >  NVIDIA Corporation GM200 [GeForce GTX TITAN X]
> >
> >           |                  \-00.1  NVIDIA Corporation Device 0fb0
> >
> >           +--04.0-[0e]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX
> TITAN
> > X]
> >
> >           |                  \-00.1  NVIDIA Corporation Device 0fb0
> >
> >           +--08.0-[0f]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX
> TITAN
> > X]
> >
> >           |                  \-00.1  NVIDIA Corporation Device 0fb0
> >
> >           +--0c.0-[10]--+-00.0  NVIDIA Corporation GM200 [GeForce GTX
> TITAN
> > X]
> >
> >           |                  \-00.1  NVIDIA Corporation Device 0fb0
> >
> > I tried the first device
> > # virsh nodedev-detach --driver=kvm pci_0000_04_00_0
> > Device pci_0000_04_00_0 detached
> >
> > # virsh nodedev-detach --driver=kvm pci_0000_04_00_1
> > Device pci_0000_04_00_1 detached
> >
> > In the script I put
> >
> > DEVS=(
> >             03:00.0
> >             04
> > )
> >
> > Ran it 100 times and got no error.
> >
> > Ran it for a different device 05
> >
> >
> >
> > # virsh nodedev-detach --driver=kvm pci_0000_05_00_0
> > Device pci_0000_05_00_0 detached
> >
> > # virsh nodedev-detach --driver=kvm pci_0000_05_00_1
> > Device pci_0000_05_00_1 detached
> >
> > DEVS=(
> >             03:04.0
> >             05:
> > )
> >
> >
> > I saw this.
> >
> > #: for i in $(seq 1 100); do ./reset.sh; done
> > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> > TITAN X] (rev a1)
> > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
> > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> > TITAN X] (rev a1)
> > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)
> > 05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX
> > TITAN X] (rev ff)
> > 05:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev ff)
> >
> > I repeated this with another device on the system.
> >
> > I assume this indicates that that the device is not resetting properly?
> The
> > question is where do I go from here? Would this indicate a problem with
> the
> > PCI Reset code or a problematic hardware?
>
> Right, the PCIe link is not coming back for some reason, that seems
> like a hardware issue.  Can you attach the output of 'sudo lspci -vvvs
> 3:04.0' when you're in this state (replace with the appropriate parent
> bridge depending on the failed device), maybe we can see if that
> downstream port is stuck in training.
>
> What I would do next is to test each card repeatedly.  Do only some
> cards fail?  If so, swap a working card and a non-working card, does
> the failure follow the card or the slot?  I'm not sure what the result
> is going to be, but if we can't rely on a PCI bus reset then you're
> really not going to have any repeat-ability with assigning the GPUs.
> Thanks,
>
> Alex
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20161019/4a05bb54/attachment.htm>


More information about the vfio-users mailing list