[vfio-users] Welcome to the "vfio-users" mailing list (Digest mode)
Kevin Vasko
kvasko at gmail.com
Mon Oct 17 21:26:15 UTC 2016
Issues with pcipassthrough reliably working with Titan X GPUs
We have a machine that has 8 Titan X GPUs in it (Cirrascale GX8). We are
trying to use KVM (openstack is doing the provisioning) and pcipassthrough
to launch VM instances on this system so multiple users can utilize GPUs,
however having some issues doing so. I would like some help/tips if
possible on how to debug this issue.
The problem is that it seems that under certain circumstances the
attachment of the GPU to the VM will fail (seemingly randomly). I will see
"unknown header type 7f, ignoring device" in the VM with the id of the
device it tried to attach.
We are using Ubutnu 3.19.0-71-generic.
I blacklisted the cards so the host UI doesn't attach.
sudo gedit /etc/modules and add:
pci_stub
vfio
vfio_iommu_type1
vfio_pci
kvm
kvm_intel
sudo update-grub
sudo vi /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on
vfio_iommu_type1.allow_unsafe_interrupts=1"
sudo update-grub
sudo gedit /etc/initramfs-tools/modules
pci_stub ids=10de:17c2,10de:1132,10de:0fb0
sudo update-initramfs -u
reboot
I'll run the command
lspci -nnk -d 10de:17c2 :
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: pci-stub
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: pci-stub
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: pci-stub
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: pci-stub
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: pci-stub
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: pci-stub
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: pci-stub
10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Kernel driver in use: pci-stub
If I go through the process of creating a VM I will get (shown below) I get
random results on whether a device will attach properly to the VM
(sometimes it does, sometimes it doesn't).
Example of a failed attachment
On the host:
#: dmesg
Netfilter messages via NETLINK v0.30
ip_set: protocol 6
vfio-pci 0000:10:00.0: enabling device (0100 -> 0103)
vfio_ecap_init: 0000:10:00.0 hiding ecap 0x1e at 0x258
vfio_ecap_init: 0000:10:00.0 hiding ecap 0x19 at 0x900
kvm:zapping shadow pages for mmio generation wraparound
kvm [11446]: vcpu0 unhandled rdmsr: 0x606
kvm [11446]: vcpu0 unhandled rdmsr: 0x611
kvm [11446]: vcpu0 unhandled rdmsr: 0x639
kvm [11446]: vcpu0 unhandled rdmsr: 0x641
kvm [11446]: vcpu0 unhandled rdmsr: 0x619
lspci -vnnn -d 10de:17c2 (this output is really long, so I'm only including
the 10:00.0 device):
10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: vfio-pci
The instances kvm command (near the end you will see -device
vfio-pci,host=10:00.0 where it is passing the GPU to the VM).
/usr/bin/kvm -name instance-00000182 -S -machine
pc-i440fx-vivid,accel=kvm,usb=off -cpu
Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 16384 -realtime mlock=off -smp 6,sockets=6,cores=1,threads=1 -uuid
dc37c94f-d6d2-42ac-8fff-1c3a6604f317 -smbios type=1,manufacturer=OpenStack
Foundation,product=OpenStack
Nova,version=13.0.0,serial=8e34e073-7b4c-4e69-84fa-2d044032ad30,uuid=dc37c94f-d6d2-42ac-8fff-1c3a6604f317,family=Virtual
Machine -no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000182.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet
-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/nova/instances/dc37c94f-d6d2-42ac-8fff-1c3a6604f317/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=writethrough
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:cf:9c:1d,bus=pci.0,addr=0x3
-chardev
file,id=charserial0,path=/var/lib/nova/instances/dc37c94f-d6d2-42ac-8fff-1c3a6604f317/console.log
-device isa-serial,chardev=charserial0,id=serial0 -chardev
pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
-device usb-tablet,id=input0 -vnc 0.0.0.0:1 -k en-us -device
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
vfio-pci,host=10:00.0,id=hostdev0,bus=pci.0,addr=0x5
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on
On the CentOS7 VM running lspci -vnnn, the device is not shown.
On the CentOS7 VM looking at dmesg I see this.
[ 0.751028] pci 0000:00:05.0: [10de:17c2] type 7f class 0xffffff
[ 0.751041] pci 0000:00:05.0: unknown header type 7f, ignoring device
At this point without doing anything different (no reboot), I startup
another VM (device 0f:00.0_, nothing being different (other than the system
using a different device ID), it will startup successfully and attach
properly.
On the Host
lspci -vnnn -d 10de:17c2
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
subsystem: NVIDIA Corporation Device [10de:1132]
Flags: fast devsel, IRQ 28
Memory at b9000000 (32-bit, non-prefetchable) [size=16M]
Memory at 38ff20000000 (64-bit, prefetchable) [size=256M]
Memory at 38ff30000000 (64-bit, prefetchable) [size=32M]
I/O ports at 3000 [size=128]
Expansion ROM at ba000000 [disabled] [size=512k]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Express Legacy Endpoint, MSI 00
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [420] Advanced Error Reporting
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
<?>
Capabilities: [900] #19
Kernel driver in use: vfio-pci
10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: vfio-pci
KVM command (as you can see host=0f:00.0 is the GPU on the host device)
/usr/bin/kvm -name instance-00000183 -S -machine
pc-i440fx-vivid,accel=kvm,usb=off -cpu
Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 16000 -realtime mlock=off -smp 6,sockets=6,cores=1,threads=1 -uuid
3c844181-2ae2-46d6-83ad-22363ad26e35 -smbios type=1,manufacturer=OpenStack
Foundation,product=OpenStack
Nova,version=13.1.1,serial=fa62c66d-7e84-45a4-addd-bf293c06c348,uuid=3c844181-2ae2-46d6-83ad-22363ad26e35,family=Virtual
Machine -no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000183.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet
-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/nova/instances/3c844181-2ae2-46d6-83ad-22363ad26e35/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:35:57:04,bus=pci.0,addr=0x3
-chardev
file,id=charserial0,path=/var/lib/nova/instances/3c844181-2ae2-46d6-83ad-22363ad26e35/console.log
-device isa-serial,chardev=charserial0,id=serial0 -chardev
pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
-device usb-tablet,id=input0 -vnc 0.0.0.0:1 -k en-us -device
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on
On the guest VM
lspci -vnnn -d 10de:17c2
00:05.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device [10de:1132]
Physical Slot: 5
Flags: fast devsel, IRQ 11
Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at f2000000 (64-bit, prefetchable) [size=32M]
I/O ports at c000 [size=128]
Expansion ROM at fe000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
dmesg
[ 0.839593] pci 0000:00:05.0: [10de:17c2] type 00 class 0x030000
Which shows that the PCIPassthrough is working properly. The only way to
reset the failed evice in (rev ff) (prog-if ff) state is reboot the host box
What I mean by random is that, sometimes it will be the first time I
attached a GPU (like in this case). Other times it will be a different one.
For example, I have tried attaching 2 of the devices to one VM, one GPU
will attach properly, the other will not and go into the Unknown header
type 7f state. I have also tried to attach 4 GPUs, 3 GPU will work and the
4th will fail. Different devices will fail and succeed (e.g. device 10:00.0
failed this time and 0f:00.0 succeeded, where as if I attach 2x GPUs to a
VM, it will be reversed, 0f:00.0 will fail and 10:00.0 will succeed), so I
don't feel it is hardware related.
Any suggestions on debugging a !!! Unknown header type 7f?
Thanks,
-Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20161017/9a065fc6/attachment.htm>
More information about the vfio-users
mailing list