[vfio-users] NVIDIA K2 passthrough sometimes fails
Martijn Kint
martijn.kint at surfsara.nl
Fri Feb 26 11:44:05 UTC 2016
Hi,
I'm seeing a weird problem in a Ubuntu (14.04.4) VM that does not always
happen. This error only happens when we attach 2 GRID K2 PCI devices,
when this error occurs the second NVIDIA device has no nvidia driver
attached to it the first card has the driver loaded. After we (hard)
reboot the VM it might come up fine the next time and both K2 devices
have the nvidia module loaded or it might behave exactly the same, it
seems to be quite random.
We are using the cards mostly for GPU calculations and 3D visualization
of scientific data so we're not building a virtual windows game PC :)
So the question is, what could cause this error? As it does not happen
every time I guess it must have something to do with the order of the
modules being loaded, but that's just a guess.
Hardware:
Fujitsu PRIMERGY CX2570 M1
2 x NVIDIA GRID K2 (4 PCI devices)
2 x Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz (HT enabled)
256GB DDR4 ECC RAM
Intel C610/X99 series chipset
HOST OS: Fedora22
Kernel: 4.3.4-200.fc22.x86_64
Qemu/KVM: qemu-kvm-2.4.1-1 (virt-preview repo)
GRUB:
GRUB_CMDLINE_LINUX="nomodeset selinux=disabled elevator=deadline
rd.driver.pre=vfio-pci rd.driver.blacklist=nouveau intel_iommu=on"
/etc/libvirt/qemu.conf
user = "qemu"
group = "qemu"
clear_emulator_capabilities = 0
dynamic_ownership = 0
cgroup_controllers = [ "cpu", "cpuacct", "cpuset" ]
max_files = 100000
cgroup_device_acl = [
"/dev/null", "/dev/full", "/dev/zero",
"/dev/random", "/dev/urandom",
"/dev/ptmx", "/dev/kvm", "/dev/kqemu",
"/dev/rtc","/dev/hpet", "/dev/vfio/vfio",
"/dev/vfio/45", "/dev/vfio/46", "/dev/vfio/58",
"/dev/vfio/59"
]
/etc/udev/rules.d/10-qemu-hw-users.rules
KERNEL=="45", SUBSYSTEM=="vfio", OWNER="qemu", GROUP="qemu", MODE="0660"
KERNEL=="46", SUBSYSTEM=="vfio", OWNER="qemu", GROUP="qemu", MODE="0660"
KERNEL=="58", SUBSYSTEM=="vfio", OWNER="qemu", GROUP="qemu", MODE="0660"
KERNEL=="59", SUBSYSTEM=="vfio", OWNER="qemu", GROUP="qemu", MODE="0660"
KERNEL=="vfio" SUBSYSTEM=="misc", OWNER="qemu", GROUP="qemu", MODE=0660"
/etc/modprobe.d/blacklist.conf:
# disable for grid K2
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
/usr/local/bin/vfio-bind:
#!/bin/sh
#
modprobe vfio-pci
for dev in "$@"; do
vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
device=$(cat /sys/bus/pci/devices/$dev/device)
if [ -e /sys/bus/pci/devices/$dev/driver ]; then
echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
fi
echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
done
/etc/sysconfig/vfio-bind
DEVICES="0000:04:00.0 0000:05:00.0 0000:84:00.0 0000:85:00.0"
/etc/systemd/system/vfio-bind.service
[Unit]
Description=Binds devices to vfio-pci
After=syslog.target
[Service]
EnvironmentFile=-/etc/sysconfig/vfio-bind
Type=oneshot
RemainAfterExit=yes
ExecStart=-/usr/local/bin/vfio-bind $DEVICES
[Install]
WantedBy=multi-user.target
Stacktrace:
Feb 26 11:36:39 k2-test kernel: [ 1.923024] BUG: unable to handle kernel
NULL pointer dereference at (null)
Feb 26 11:36:39 k2-test kernel: [ 1.923031] IP: [<ffffffff817b669c>]
__down_common+0x45/0x10e
Feb 26 11:36:39 k2-test kernel: [ 1.923032] PGD 42a88c067 PUD 42a885067
PMD 0
Feb 26 11:36:39 k2-test kernel: [ 1.923034] Oops: 0002 [#1] SMP
Feb 26 11:36:39 k2-test kernel: [ 1.923042] Modules linked in:
crc32_pclmul(+) ghash_clmulni_intel(-) aesni_intel aes_x86_64 ppdev lrw
gf128mul glue_helper nvidia(POE+) ablk_helper cryptd serio_raw
8250_fintek parport_pc ttm drm_kms_helper drm syscopyarea sysfillrect
sysimgblt mac_hid i2c_piix4 lp parport nls_utf8 isofs floppy psmouse
pata_acpi
Feb 26 11:36:39 k2-test kernel: [ 1.923045] CPU: 2 PID: 560 Comm:
nvidia-persiste Tainted: P OE 3.19.0-51-generic #57~14.04.1-Ubuntu
Feb 26 11:36:39 k2-test kernel: [ 1.923046] Hardware name: QEMU Standard
PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150715_102347- 04/01/2014
Feb 26 11:36:39 k2-test kernel: [ 1.923047] task: ffff88042d0844b0 ti:
ffff88042b210000 task.ti: ffff88042b210000
Feb 26 11:36:39 k2-test kernel: [ 1.923049] RIP:
0010:[<ffffffff817b669c>] [<ffffffff817b669c>] __down_common+0x45/0x10e
Feb 26 11:36:39 k2-test kernel: [ 1.923050] RSP: 0018:ffff88042b213ad8
EFLAGS: 00010096
Feb 26 11:36:39 k2-test kernel: [ 1.923050] RAX: 0000000000000000 RBX:
ffffffffc1435540 RCX: ffffffffc1435548
Feb 26 11:36:39 k2-test kernel: [ 1.923051] RDX: ffff88042b213ae8 RSI:
0000000000000002 RDI: ffffffffc1435540
Feb 26 11:36:39 k2-test kernel: [ 1.923052] RBP: ffff88042b213b38 R08:
000000000001d850 R09: ffffffffc1163d4b
Feb 26 11:36:39 k2-test kernel: [ 1.923052] R10: 0000000000000020 R11:
00000000000000ff R12: 7fffffffffffffff
Feb 26 11:36:39 k2-test kernel: [ 1.923052] R13: ffff88042d0844b0 R14:
0000000000000002 R15: 0000000000000000
Feb 26 11:36:39 k2-test kernel: [ 1.923053] FS: 00007fa750b87740(0000)
GS:ffff88043fc80000(0000) knlGS:0000000000000000
Feb 26 11:36:39 k2-test kernel: [ 1.923054] CS: 0010 DS: 0000 ES: 0000
CR0: 0000000080050033
Feb 26 11:36:39 k2-test kernel: [ 1.923055] CR2: 0000000000000000 CR3:
000000042a886000 CR4: 00000000001407e0
Feb 26 11:36:39 k2-test kernel: [ 1.923058] Stack:
Feb 26 11:36:39 k2-test kernel: [ 1.923059] 0000000000000000
00000000000200da ffffffffc1435548 0000000000000000
Feb 26 11:36:39 k2-test kernel: [ 1.923061] 0000000000000000
00000000000000d0 00000000000000d0 ffffffffc1435540
Feb 26 11:36:39 k2-test kernel: [ 1.923062] ffff88042b388000
0000000000000003 ffff88042aa78a98 0000000000000002
Feb 26 11:36:39 k2-test kernel: [ 1.923062] Call Trace:
Feb 26 11:36:39 k2-test kernel: [ 1.923065] [<ffffffff817b6782>]
__down+0x1d/0x1f
Feb 26 11:36:39 k2-test kernel: [ 1.923070] [<ffffffff810bb971>]
down+0x41/0x50
Feb 26 11:36:39 k2-test kernel: [ 1.923142] [<ffffffffc1164087>]
nvidia_open+0x3c7/0x9c0 [nvidia]
Feb 26 11:36:39 k2-test kernel: [ 1.923176] [<ffffffffc1162ded>]
nvidia_frontend_open+0x4d/0xa0 [nvidia]
Feb 26 11:36:39 k2-test kernel: [ 1.923179] [<ffffffff811f117f>]
chrdev_open+0x9f/0x1d0
Feb 26 11:36:39 k2-test kernel: [ 1.923181] [<ffffffff811e9c37>]
do_dentry_open+0x1f7/0x340
Feb 26 11:36:39 k2-test kernel: [ 1.923182] [<ffffffff811f10e0>] ?
cdev_put+0x30/0x30
Feb 26 11:36:39 k2-test kernel: [ 1.923184] [<ffffffff811eb487>]
vfs_open+0x57/0x60
Feb 26 11:36:39 k2-test kernel: [ 1.923186] [<ffffffff811fb3dc>]
do_last+0x4ec/0x1190
Feb 26 11:36:39 k2-test kernel: [ 1.923188] [<ffffffff811fc100>]
path_openat+0x80/0x600
Feb 26 11:36:39 k2-test kernel: [ 1.923191] [<ffffffff810d629d>] ?
call_rcu_sched+0x1d/0x20
Feb 26 11:36:39 k2-test kernel: [ 1.923195] [<ffffffff81075ffa>] ?
release_task+0x38a/0x470
Feb 26 11:36:39 k2-test kernel: [ 1.923196] [<ffffffff811fd81a>]
do_filp_open+0x3a/0x90
Feb 26 11:36:39 k2-test kernel: [ 1.923199] [<ffffffff8120a407>] ?
__alloc_fd+0xa7/0x130
Feb 26 11:36:39 k2-test kernel: [ 1.923200] [<ffffffff811eb809>]
do_sys_open+0x129/0x280
Feb 26 11:36:39 k2-test kernel: [ 1.923202] [<ffffffff81075b80>] ?
task_stopped_code+0x60/0x60
Feb 26 11:36:39 k2-test kernel: [ 1.923203] [<ffffffff811eb97e>]
SyS_open+0x1e/0x20
Feb 26 11:36:39 k2-test kernel: [ 1.923206] [<ffffffff817b874d>]
system_call_fastpath+0x16/0x1b
Feb 26 11:36:39 k2-test kernel: [ 1.923216] Code: 55 65 4c 8b 2c 25 00
b9 00 00 41 54 49 89 d4 48 8d 55 b0 53 48 89 fb 48 83 ec 38 48 8b 47 10
48 89 4d b0 48 89 57 10 48 89 45 b8 <48> 89 10 48 89 f0 83 e0 01 4c 89
6d c0 c6 45 c8 00 48 89 45 a8
Feb 26 11:36:39 k2-test kernel: [ 1.923218] RIP [<ffffffff817b669c>]
__down_common+0x45/0x10e
Feb 26 11:36:39 k2-test kernel: [ 1.923218] RSP <ffff88042b213ad8>
Feb 26 11:36:39 k2-test kernel: [ 1.923219] CR2: 0000000000000000
The VM has the following relevant NVIDA (cuda) drivers installed:
ii cuda-nvrtc-7-5 7.5-18
amd64 NVRTC native runtime libraries
ii cuda-nvrtc-dev-7-5 7.5-18
amd64 NVRTC native dev links, headers
ii libxnvctrl0 352.79-0ubuntu1
amd64 NV-CONTROL X extension (runtime library)
ii nvidia-352 352.79-0ubuntu1
amd64 NVIDIA binary driver - version 352.79
ii nvidia-352-dev 352.79-0ubuntu1
amd64 NVIDIA binary Xorg driver development files
ii nvidia-352-uvm 352.79-0ubuntu1
amd64 Transitional package for nvidia-352
ii nvidia-modprobe 352.79-0ubuntu1
amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-icd-352 352.79-0ubuntu1
amd64 NVIDIA OpenCL ICD
ii nvidia-settings 352.79-0ubuntu1
amd64 Tool for configuring the NVIDIA graphics driver
VM kernel:
uname -a
Linux k2-test kernel 3.19.0-51-generic #57~14.04.1-Ubuntu SMP Fri Feb 19
14:36:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
XML VM:
<domain type='kvm' id='8'>
<name>one-221</name>
<uuid>0f415850-451e-465b-8ad4-bb6cd84209d2</uuid>
<metadata>
<system_datastore>/var/lib/one//datastores/109/221</system_datastore>
</metadata>
<memory unit='KiB'>16777216</memory>
<currentMemory unit='KiB'>16777216</currentMemory>
<vcpu placement='static'>8</vcpu>
<cputune>
<shares>8192</shares>
</cputune>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='x86_64' machine='pc-i440fx-2.4'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
</features>
<cpu mode='host-passthrough'/>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<devices>
<emulator>/usr/bin/qemu-kvm</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none'/>
<source file='/var/lib/one//datastores/109/221/disk.0'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04'
function='0x0'/>
</disk>
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source file='/var/lib/one//datastores/109/221/disk.1'/>
<backingStore/>
<target dev='hda' bus='ide'/>
<readonly/>
<alias name='ide0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<controller type='usb' index='0'>
<alias name='usb'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x2'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<alias name='pci.0'/>
</controller>
<controller type='ide' index='0'>
<alias name='ide'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01'
function='0x1'/>
</controller>
<interface type='bridge'>
<mac address='04:09:92:65:3b:1d'/>
<source bridge='ovsbridge0'/>
<virtualport type='openvswitch'>
<parameters interfaceid='0576023f-b955-4d46-8129-bbcb5e26dfa2'/>
</virtualport>
<target dev='vnet1'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>
</interface>
<input type='mouse' bus='ps2'/>
<input type='keyboard' bus='ps2'/>
<graphics type='vnc' port='6121' autoport='no' listen='0.0.0.0'>
<listen type='address' address='0.0.0.0'/>
</graphics>
<video>
<model type='cirrus' vram='16384' heads='1'/>
<alias name='video0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02'
function='0x0'/>
</video>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x84' slot='0x00' function='0x0'/>
</source>
<alias name='hostdev0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x85' slot='0x00' function='0x0'/>
</source>
<alias name='hostdev1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06'
function='0x0'/>
</hostdev>
<memballoon model='virtio'>
<alias name='balloon0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
</memballoon>
</devices>
</domain>
--
Kind regards,
Martijn Kint
Systeem Expert Big Data Services & HPC Cloud
e-mail: martijn.kint at surfsara.nl | M: +31 6 16 38 64 69
SURFsara | Science Park 140 | 1098 XG Amsterdam
More information about the vfio-users
mailing list