[vfio-users] Welcome to the "vfio-users" mailing list (Digest mode)

Kevin Vasko kvasko at gmail.com
Mon Oct 17 21:26:15 UTC 2016


Issues with pcipassthrough reliably working with Titan X GPUs

We have a machine that has 8 Titan X GPUs in it (Cirrascale GX8). We are
trying to use KVM (openstack is doing the provisioning) and pcipassthrough
to launch VM instances on this system so multiple users can utilize GPUs,
however having some issues doing so. I would like some help/tips if
possible on how to debug this issue.

The problem is that it seems that under certain circumstances the
attachment of the GPU to the VM will fail (seemingly randomly). I will see
"unknown header type 7f, ignoring device" in the VM with the id of the
device it tried to attach.

We are using Ubutnu 3.19.0-71-generic.

I blacklisted the cards so the host UI doesn't attach.

sudo gedit /etc/modules and add:

pci_stub

vfio

vfio_iommu_type1

vfio_pci

kvm

kvm_intel

sudo update-grub

sudo vi /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on
vfio_iommu_type1.allow_unsafe_interrupts=1"

sudo update-grub


sudo gedit /etc/initramfs-tools/modules

pci_stub ids=10de:17c2,10de:1132,10de:0fb0

sudo update-initramfs -u
reboot

I'll run the command
lspci -nnk -d 10de:17c2 :

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

            Kernel driver in use: pci-stub

05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

            Kernel driver in use: pci-stub

06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

            Kernel driver in use: pci-stub

07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

            Kernel driver in use: pci-stub

0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

            Kernel driver in use: pci-stub

0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

            Kernel driver in use: pci-stub

0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

            Kernel driver in use: pci-stub

10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

            Kernel driver in use: pci-stub


If I go through the process of creating a VM I will get (shown below) I get
random results on whether a device will attach properly to the VM
(sometimes it does, sometimes it doesn't).


Example of a failed attachment


On the host:


#: dmesg


Netfilter messages via NETLINK v0.30

ip_set: protocol 6

vfio-pci 0000:10:00.0: enabling device (0100 -> 0103)

vfio_ecap_init: 0000:10:00.0 hiding ecap 0x1e at 0x258

vfio_ecap_init: 0000:10:00.0 hiding ecap 0x19 at 0x900

kvm:zapping shadow pages for mmio generation wraparound

kvm [11446]: vcpu0 unhandled rdmsr: 0x606

kvm [11446]: vcpu0 unhandled rdmsr: 0x611

kvm [11446]: vcpu0 unhandled rdmsr: 0x639

kvm [11446]: vcpu0 unhandled rdmsr: 0x641

kvm [11446]: vcpu0 unhandled rdmsr: 0x619



lspci -vnnn -d 10de:17c2 (this output is really long, so I'm only including
the 10:00.0 device):

10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)

            !!! Unknown header type 7f

            Kernel driver in use: vfio-pci


The instances kvm command (near the end you will see -device
vfio-pci,host=10:00.0 where it is passing the GPU to the VM).


/usr/bin/kvm -name instance-00000182 -S -machine
pc-i440fx-vivid,accel=kvm,usb=off -cpu
Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 16384 -realtime mlock=off -smp 6,sockets=6,cores=1,threads=1 -uuid
dc37c94f-d6d2-42ac-8fff-1c3a6604f317 -smbios type=1,manufacturer=OpenStack
Foundation,product=OpenStack
Nova,version=13.0.0,serial=8e34e073-7b4c-4e69-84fa-2d044032ad30,uuid=dc37c94f-d6d2-42ac-8fff-1c3a6604f317,family=Virtual
Machine -no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000182.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet
-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/nova/instances/dc37c94f-d6d2-42ac-8fff-1c3a6604f317/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=writethrough
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:cf:9c:1d,bus=pci.0,addr=0x3
-chardev
file,id=charserial0,path=/var/lib/nova/instances/dc37c94f-d6d2-42ac-8fff-1c3a6604f317/console.log
-device isa-serial,chardev=charserial0,id=serial0 -chardev
pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
-device usb-tablet,id=input0 -vnc 0.0.0.0:1 -k en-us -device
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
vfio-pci,host=10:00.0,id=hostdev0,bus=pci.0,addr=0x5
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on



On the CentOS7 VM running lspci -vnnn, the device is not shown.


On the CentOS7 VM looking at dmesg I see this.


[    0.751028] pci 0000:00:05.0: [10de:17c2] type 7f class 0xffffff

[    0.751041] pci 0000:00:05.0: unknown header type 7f, ignoring device




At this point without doing anything different (no reboot), I startup
another VM (device 0f:00.0_, nothing being different (other than the system
using a different device ID), it will startup successfully and attach
properly.


On the Host


lspci -vnnn -d 10de:17c2


0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

     subsystem: NVIDIA Corporation Device [10de:1132]

     Flags: fast devsel, IRQ 28

     Memory at b9000000 (32-bit, non-prefetchable) [size=16M]

     Memory at 38ff20000000 (64-bit, prefetchable) [size=256M]

     Memory at 38ff30000000 (64-bit, prefetchable) [size=32M]

     I/O ports at 3000 [size=128]

     Expansion ROM at ba000000 [disabled] [size=512k]

     Capabilities: [60] Power Management version 3

     Capabilities: [68] MSI: Enable-1 Count=1/1 Maskable- 64bit+

     Capabilities: [78] Express Legacy Endpoint, MSI 00

     Capabilities: [100] Express Legacy Endpoint, MSI 00

     Capabilities: [250] Latency Tolerance Reporting

     Capabilities: [258] L1 PM Substates

     Capabilities: [128] Power Budgeting <?>

     Capabilities: [420] Advanced Error Reporting

     Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
<?>

     Capabilities: [900] #19

     Kernel driver in use: vfio-pci


10:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev ff) (prog-if ff)

            !!! Unknown header type 7f

            Kernel driver in use: vfio-pci


KVM command (as you can see host=0f:00.0 is the GPU on the host device)


/usr/bin/kvm -name instance-00000183 -S -machine
pc-i440fx-vivid,accel=kvm,usb=off -cpu
Haswell-noTSX,+abm,+pdpe1gb,+rdrand,+f16c,+osxsave,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 16000 -realtime mlock=off -smp 6,sockets=6,cores=1,threads=1 -uuid
3c844181-2ae2-46d6-83ad-22363ad26e35 -smbios type=1,manufacturer=OpenStack
Foundation,product=OpenStack
Nova,version=13.1.1,serial=fa62c66d-7e84-45a4-addd-bf293c06c348,uuid=3c844181-2ae2-46d6-83ad-22363ad26e35,family=Virtual
Machine -no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000183.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet
-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/nova/instances/3c844181-2ae2-46d6-83ad-22363ad26e35/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:35:57:04,bus=pci.0,addr=0x3
-chardev
file,id=charserial0,path=/var/lib/nova/instances/3c844181-2ae2-46d6-83ad-22363ad26e35/console.log
-device isa-serial,chardev=charserial0,id=serial0 -chardev
pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
-device usb-tablet,id=input0 -vnc 0.0.0.0:1 -k en-us -device
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
vfio-pci,host=0f:00.0,id=hostdev0,bus=pci.0,addr=0x5 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on


On the guest VM


lspci -vnnn -d 10de:17c2


00:05.0 VGA compatible controller [0300]: NVIDIA Corporation GM200 [GeForce
GTX TITAN X] [10de:17c2] (rev a1) (prog-if 00 [VGA controller])

        Subsystem: NVIDIA Corporation Device [10de:1132]

        Physical Slot: 5

        Flags: fast devsel, IRQ 11

        Memory at fd000000 (32-bit, non-prefetchable) [size=16M]

        Memory at e0000000 (64-bit, prefetchable) [size=256M]

        Memory at f2000000 (64-bit, prefetchable) [size=32M]

        I/O ports at c000 [size=128]

        Expansion ROM at fe000000 [disabled] [size=512K]

        Capabilities: [60] Power Management version 3

        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+

        Capabilities: [78] Express Legacy Endpoint, MSI 00



dmesg

[    0.839593] pci 0000:00:05.0: [10de:17c2] type 00 class 0x030000



Which shows that the PCIPassthrough is working properly. The only way to
reset the failed evice in (rev ff) (prog-if ff) state is reboot the host box

What I mean by random is that, sometimes it will be the first time I
attached a GPU (like in this case). Other times it will be a different one.

For example, I have tried attaching 2 of the devices to one VM, one GPU
will attach properly, the other will not and go into the Unknown header
type 7f state. I have also tried to attach 4 GPUs, 3 GPU will work and the
4th will fail. Different devices will fail and succeed (e.g. device 10:00.0
failed this time and 0f:00.0 succeeded, where as if I attach 2x GPUs to a
VM, it will be reversed, 0f:00.0 will fail and 10:00.0 will succeed), so I
don't feel it is hardware related.

Any suggestions on debugging a !!! Unknown header type 7f?

Thanks,

-Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20161017/9a065fc6/attachment.htm>


More information about the vfio-users mailing list