[vfio-users] AMDGPU rebind kernel bug

Gary gary at mups.co.uk
Sat Aug 18 00:48:24 UTC 2018


Hi all,

I have vfio-pci configured to allow Linux host to run on intel iGPU
whilst a 8GB Sapphire Nitro+ RX580 is passed through using virt-manager
to a Windows 10 VM. As long as I eject the GPU in windows before
shutting down the VM, everything works (amd reset bug?).

I would however like to use the RX580 in the host when the VM is not
running. In order to do this I removed the vfio-pci ids= option allowing
the amdgpu module to bind as normal. I also updated my xorg config to:

  Section "Device"
      Identifier "Intel Graphics"
      Driver "intel"
      Option "DRI" "3"
  EndSection

  Section "ServerFlags"
  	Option "AutoAddGPU" "off"
  EndSection

  Section "Device"
      Identifier "AMDGPU"
      Driver "amdgpu"
      Option "DRI3" "1"
      Option "Ignore" "1"
  EndSection

This allows me to use the intel graphics or via DRI_PRIME=1 the AMD
graphics. I can also start the VM and virt-manager will rebind the
GPU/GPUAudio to vfio-pci and the VM works nicely.

The problem with this setup comes when I eject the GPU in windows.
virt-manager in the host locks up and dmesg shows a kernel bug message
(full error at end of email)


  [  423.535829] ------------[ cut here ]------------
  [  423.535830] kernel BUG at /build/linux-hvYKKE/linux-4.17.8/drivers
/iommu/intel-iommu.c:732!
  [  423.535835] invalid opcode: 0000 [#1] SMP PTI
  [  423.535836] Modules linked in: tun fuse ebtable_filter...


After a power cycle and thinking this may be to do with the amdgpu
module rebind, I tried unloading the amdgpu module whilst the the VM was
running and thus the GPU bound to vfio-pci. Ejecting the GPU in windows
no longer caused virt-manager to lockup and I could then shut down the
VM via virt-manager.

However, this just delays the issue, when an attempt is made to rebind
the AMDGPU I once more get a lockup, this time with the dmesg error:

  [  982.416988] BUG: unable to handle kernel paging request at
ffffb9ad1281a2b4
  [  982.416992] PGD 41e921067 P4D 41e921067 PUD 0
  [  982.416995] Oops: 0002 [#1] SMP PTI
  [  982.416997] Modules linked in: amdgpu(+) chash gpu_sched...

Note, the lockup is of the graphics output. I can still SSH into the
machine, although trying to shut the machine down does not get too far.

Is this in anyway related to the AMD reset bug? If not, any idea if
there's a fix or workaround or any further information I could provide
to help troubleshoot this?


Full trace from dmesg for the two errors follows

----------------------- FIRST Error ------------------------------
[  423.535829] ------------[ cut here ]------------
[  423.535830] kernel BUG at
/build/linux-hvYKKE/linux-4.17.8/drivers/iommu/intel-iommu.c:732!
[  423.535835] invalid opcode: 0000 [#1] SMP PTI
[  423.535836] Modules linked in: tun fuse ebtable_filter ebtables
bridge stp llc cpufreq_powersave cpufreq_userspace cpufreq_conservative
binfmt_misc nls_ascii nls_cp437 vfat fat snd_hda_codec_realtek
snd_hda_codec_generic amdkfd ip6t_REJECT nf_reject_ipv6 nf_log_ipv6
xt_hl ip6t_rt amdgpu snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support
intel_rapl nf_conntrack_ipv6 nf_defrag_ipv6 x86_pkg_temp_thermal
intel_powerclamp snd_hda_intel coretemp chash snd_hda_codec gpu_sched
snd_hda_core kvm_intel i915 kvm ttm snd_hwdep efi_pstore intel_cstate
snd_pcm intel_uncore intel_rapl_perf ipt_REJECT nf_reject_ipv4 serio_raw
snd_timer pcspkr efivars drm_kms_helper nf_log_ipv4 sg snd drm joydev
evdev mei_me lpc_ich i2c_algo_bit soundcore mei shpchp ie31200_edac
nf_log_common xt_LOG video button xt_limit xt_tcpudp xt_addrtype
[  423.535866]  nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
ip6table_filter ip6_tables nf_conntrack_netbios_ns
nf_conntrack_broadcast nf_nat_ftp nf_nat vfio_pci vfio_virqfd
vfio_iommu_type1 nf_conntrack_ftp vfio irqbypass nf_conntrack parport_pc
ppdev lp iptable_filter parport sunrpc efivarfs ip_tables x_tables
autofs4 ext4 crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress
zstd_compress xxhash algif_skcipher af_alg dm_crypt raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq
libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod dm_mod
sd_mod hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd
glue_helper psmouse ahci i2c_i801 libahci xhci_pci ehci_pci libata
xhci_hcd ehci_hcd
[  423.535894]  alx scsi_mod mdio thermal usbcore usb_common fan
[  423.535899] CPU: 2 PID: 3815 Comm: libvirtd Not tainted
4.17.0-0.bpo.1-amd64 #1 Debian 4.17.8-1~bpo9+1
[  423.535900] Hardware name: Gigabyte Technology Co., Ltd. To be filled
by O.E.M./B75-D3V, BIOS F9 10/23/2013
[  423.535905] RIP: 0010:domain_get_iommu+0x4e/0x60
[  423.535906] RSP: 0018:ffffa52d48a4bb48 EFLAGS: 00010202
[  423.535907] RAX: 0000000000000001 RBX: 0000000080c27000 RCX:
0000000000000000
[  423.535908] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff8b4a595d4d00
[  423.535909] RBP: 0000000000000000 R08: 00000000000272d0 R09:
ffffffff994ef4b7
[  423.535910] R10: ffffa52d48a4ba60 R11: ffffe0d58fd21f20 R12:
ffff8b4a5c5fb0a0
[  423.535911] R13: 000000ffffffffff R14: ffff8b4a595d4d00 R15:
0000000000001000
[  423.535913] FS:  00007f287deb2700(0000) GS:ffff8b4a6e300000(0000)
knlGS:0000000000000000
[  423.535914] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  423.535915] CR2: fffff80077770000 CR3: 000000041772c003 CR4:
00000000001626e0
[  423.535916] Call Trace:
[  423.535920]  __intel_map_single+0x61/0x180
[  423.535957]  amdgpu_gart_init+0x5e/0x100 [amdgpu]
[  423.535983]  gmc_v8_0_sw_init+0x669/0x700 [amdgpu]
[  423.535997]  ? drm_detect_hdmi_monitor+0x3e/0xe0 [drm]
[  423.536017]  amdgpu_device_init+0x102a/0x1490 [amdgpu]
[  423.536019]  ? kmalloc_order+0x14/0x40
[  423.536039]  amdgpu_driver_load_kms+0x86/0x2c0 [amdgpu]
[  423.536046]  drm_dev_register+0x132/0x1c0 [drm]
[  423.536066]  amdgpu_pci_probe+0x1b5/0x280 [amdgpu]
[  423.536069]  local_pci_probe+0x44/0xa0
[  423.536072]  ? _cond_resched+0x16/0x40
[  423.536074]  pci_device_probe+0x102/0x1b0
[  423.536077]  driver_probe_device+0x2b2/0x490
[  423.536079]  ? __driver_attach+0xe0/0xe0
[  423.536080]  bus_for_each_drv+0x64/0xb0
[  423.536082]  __device_attach+0xd9/0x150
[  423.536084]  bus_rescan_devices_helper+0x30/0x50
[  423.536086]  store_drivers_probe+0x2d/0x60
[  423.536088]  kernfs_fop_write+0x10f/0x190
[  423.536091]  vfs_write+0xb0/0x190
[  423.536093]  ksys_write+0x52/0xc0
[  423.536095]  do_syscall_64+0x55/0x110
[  423.536097]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  423.536098] RIP: 0033:0x7f28a4c4b1ad
[  423.536099] RSP: 002b:00007f287deb1930 EFLAGS: 00000293 ORIG_RAX:
0000000000000001
[  423.536101] RAX: ffffffffffffffda RBX: 0000000000000016 RCX:
00007f28a4c4b1ad
[  423.536102] RDX: 000000000000000c RSI: 00007f2858008d24 RDI:
0000000000000016
[  423.536103] RBP: 000000000000000c R08: 00007f28540009e0 R09:
0000000000000000
[  423.536104] R10: 00007f28a84ce903 R11: 0000000000000293 R12:
00007f2858008d24
[  423.536105] R13: 0000000000000000 R14: 0000000000000016 R15:
00007f2854000a00
[  423.536106] Code: 74 0d eb 29 48 83 c7 04 8b 4f fc 85 c9 75 0a 83 c0
01 39 d0 75 ee 31 c0 c3 48 98 48 c1 e0 03 48 8b 15 a7 4e 14 01 48 8b 04
02 c3 <0f> 0b 31 c0 eb ee 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
[  423.536126] RIP: domain_get_iommu+0x4e/0x60 RSP: ffffa52d48a4bb48
[  423.536128] ---[ end trace 68f635a30860d3cb ]---




----------------------- SECOND Error ------------------------------

[  981.069606] [drm] amdgpu kernel modesetting enabled.
[  981.069826] [drm] initializing kernel modesetting (POLARIS10
0x1002:0x67DF 0x1DA2:0xE366 0xE7).
[  981.069845] [drm] register mmio base: 0xF7D00000
[  981.069845] [drm] register mmio size: 262144
[  981.069851] [drm] probing gen 2 caps for device 8086:151 = 261ad03/e
[  981.069852] [drm] probing mlw for device 8086:151 = 261ad03
[  981.069853] [drm] add ip block number 0 <vi_common>
[  981.069854] [drm] add ip block number 1 <gmc_v8_0>
[  981.069855] [drm] add ip block number 2 <tonga_ih>
[  981.069855] [drm] add ip block number 3 <powerplay>
[  981.069856] [drm] add ip block number 4 <dm>
[  981.069856] [drm] add ip block number 5 <gfx_v8_0>
[  981.069857] [drm] add ip block number 6 <sdma_v3_0>
[  981.069857] [drm] add ip block number 7 <uvd_v6_0>
[  981.069858] [drm] add ip block number 8 <vce_v3_0>
[  981.069861] kfd kfd: skipped device 1002:67df, PCI rejects atomics
[  981.069868] [drm] UVD is enabled in VM mode
[  981.069868] [drm] UVD ENC is enabled in VM mode
[  981.069869] [drm] VCE enabled in VM mode
[  982.413309] ATOM BIOS: 113-BE366EU-Z48
[  982.413358] [drm] vm size is 64 GB, 2 levels, block size is 10-bit,
fragment size is 9-bit
[  982.413429] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_mc.bin
[  982.413437] amdgpu 0000:01:00.0: VRAM: 8192M 0x000000F400000000 -
0x000000F5FFFFFFFF (8192M used)
[  982.413438] amdgpu 0000:01:00.0: GTT: 256M 0x0000000000000000 -
0x000000000FFFFFFF
[  982.413446] [drm] Detected VRAM RAM=8192M, BAR=256M
[  982.413447] [drm] RAM width 256bits GDDR5
[  982.413562] [TTM] Zone  kernel: Available graphics memory: 7701472 kiB
[  982.413563] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[  982.413564] [TTM] Initializing pool allocator
[  982.413568] [TTM] Initializing DMA pool allocator
[  982.413858] [drm] amdgpu: 8192M of VRAM memory ready
[  982.413859] [drm] amdgpu: 8192M of GTT memory ready.
[  982.413876] DMAR: 64bit 0000:01:00.0 uses identity mapping
[  982.413877] [drm] GART: num cpu pages 65536, num gpu pages 65536
[  982.413910] [drm] PCIE GART of 256M enabled (table at
0x000000F400040000).
[  982.414019] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_pfp_2.bin
[  982.414033] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_me_2.bin
[  982.414046] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_ce_2.bin
[  982.414046] [drm] Chained IB support enabled!
[  982.414058] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_rlc.bin
[  982.414138] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_mec_2.bin
[  982.414240] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_mec2_2.bin
[  982.415203] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_sdma.bin
[  982.415220] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_sdma1.bin
[  982.415397] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_uvd.bin
[  982.415400] [drm] Found UVD firmware Version: 1.130 Family ID: 16
[  982.416620] amdgpu 0000:01:00.0: firmware: direct-loading firmware
amdgpu/polaris10_vce.bin
[  982.416624] [drm] Found VCE firmware Version: 53.26 Binary ID: 3
[  982.416988] BUG: unable to handle kernel paging request at
ffffb9ad1281a2b4
[  982.416992] PGD 41e921067 P4D 41e921067 PUD 0
[  982.416995] Oops: 0002 [#1] SMP PTI
[  982.416997] Modules linked in: amdgpu(+) chash gpu_sched ttm tun fuse
ebtable_filter ebtables bridge stp llc cpufreq_powersave
cpufreq_userspace cpufreq_conservative binfmt_misc intel_rapl
x86_pkg_temp_thermal intel_powerclamp nls_ascii nls_cp437 vfat fat
coretemp iTCO_wdt iTCO_vendor_support kvm_intel ip6t_REJECT
nf_reject_ipv6 snd_hda_codec_realtek nf_log_ipv6 kvm amdkfd intel_cstate
snd_hda_codec_generic efi_pstore intel_uncore xt_hl intel_rapl_perf
ip6t_rt i915 efivars serio_raw pcspkr snd_hda_codec_hdmi snd_hda_intel
snd_hda_codec drm_kms_helper snd_hda_core snd_hwdep snd_pcm drm
snd_timer joydev mei_me nf_conntrack_ipv6 evdev snd sg lpc_ich soundcore
mei shpchp i2c_algo_bit ie31200_edac nf_defrag_ipv6 video button
ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit
xt_tcpudp
[  982.417033]  xt_addrtype nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns
nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack
iptable_filter vfio_pci vfio_virqfd vfio_iommu_type1 vfio irqbypass
sunrpc parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4
ext4 crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress zstd_compress
xxhash algif_skcipher af_alg dm_crypt raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
crc32c_generic raid1 raid0 multipath linear md_mod dm_mod sd_mod
hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel pcbc ahci aesni_intel aes_x86_64 crypto_simd libahci
cryptd psmouse glue_helper i2c_i801 xhci_pci libata ehci_pci xhci_hcd
[  982.417071]  ehci_hcd scsi_mod alx mdio usbcore usb_common fan
thermal [last unloaded: chash]
[  982.417078] CPU: 2 PID: 3332 Comm: modprobe Not tainted
4.17.0-0.bpo.1-amd64 #1 Debian 4.17.8-1~bpo9+1
[  982.417080] Hardware name: Gigabyte Technology Co., Ltd. To be filled
by O.E.M./B75-D3V, BIOS F9 10/23/2013
[  982.417142] RIP:
0010:smu7_populate_single_firmware_entry.isra.5+0x89/0xe0 [amdgpu]
[  982.417143] RSP: 0018:ffffb991420d7950 EFLAGS: 00010246
[  982.417145] RAX: 000000000000008c RBX: 0000000000000003 RCX:
0000000000000000
[  982.417147] RDX: ffffffffc0f68a64 RSI: 0000000000000004 RDI:
ffff8cafdb9c4360
[  982.417148] RBP: ffffb9ad1281a2b4 R08: 0000000000000002 R09:
ffffb991493be000
[  982.417149] R10: 00000000802a0001 R11: 0000000000000001 R12:
ffff8cafd698d040
[  982.417151] R13: ffff8cafa26fe000 R14: 000000000000047e R15:
0000000000000003
[  982.417154] FS:  00007fb5f5737700(0000) GS:ffff8cafee300000(0000)
knlGS:0000000000000000
[  982.417155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  982.417157] CR2: ffffb9ad1281a2b4 CR3: 00000003ee264003 CR4:
00000000001606e0
[  982.417158] Call Trace:
[  982.417208]  smu7_request_smu_load_fw+0x97/0x320 [amdgpu]
[  982.417252]  polaris10_start_smu+0x64/0x4c0 [amdgpu]
[  982.417293]  ? amdgpu_ucode_init_bo+0xe2/0x270 [amdgpu]
[  982.417341]  pp_hw_init+0x4c/0xd0 [amdgpu]
[  982.417378]  amdgpu_device_init+0x13c3/0x1490 [amdgpu]
[  982.417383]  ? kmalloc_order+0x14/0x40
[  982.417419]  amdgpu_driver_load_kms+0x86/0x2c0 [amdgpu]
[  982.417433]  drm_dev_register+0x132/0x1c0 [drm]
[  982.417469]  amdgpu_pci_probe+0x1b5/0x280 [amdgpu]
[  982.417474]  local_pci_probe+0x44/0xa0
[  982.417478]  ? _cond_resched+0x16/0x40
[  982.417481]  pci_device_probe+0x102/0x1b0
[  982.417484]  driver_probe_device+0x2b2/0x490
[  982.417486]  __driver_attach+0xdd/0xe0
[  982.417489]  ? driver_probe_device+0x490/0x490
[  982.417491]  bus_for_each_dev+0x67/0xc0
[  982.417494]  ? klist_add_tail+0x3b/0x70
[  982.417496]  bus_add_driver+0x16a/0x260
[  982.417499]  driver_register+0x57/0xc0
[  982.417501]  ? 0xffffffffc1199000
[  982.417503]  do_one_initcall+0x4d/0x1c5
[  982.417506]  ? _cond_resched+0x16/0x40
[  982.417509]  ? kmem_cache_alloc_trace+0x15d/0x1c0
[  982.417512]  ? do_init_module+0x22/0x218
[  982.417515]  do_init_module+0x5b/0x218
[  982.417518]  load_module.constprop.55+0x2548/0x2d50
[  982.417521]  ? vfs_read+0x119/0x130
[  982.417524]  ? __do_sys_finit_module+0xd2/0x100
[  982.417526]  __do_sys_finit_module+0xd2/0x100
[  982.417530]  do_syscall_64+0x55/0x110
[  982.417532]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  982.417535] RIP: 0033:0x7fb5f52ac229
[  982.417536] RSP: 002b:00007ffe1335d988 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[  982.417538] RAX: ffffffffffffffda RBX: 00005596844ee4c0 RCX:
00007fb5f52ac229
[  982.417540] RDX: 0000000000000000 RSI: 0000559683708638 RDI:
0000000000000006
[  982.417541] RBP: 0000559683708638 R08: 0000000000000000 R09:
0000000000000000
[  982.417542] R10: 0000000000000006 R11: 0000000000000246 R12:
0000000000000000
[  982.417544] R13: 00005596844ef830 R14: 0000000000040000 R15:
0000000000000000
[  982.417545] Code: c0 83 e3 fb 0f 94 c0 66 89 45 18 31 c0 48 8b 4c 24
30 65 48 33 0c 25 28 00 00 00 75 5c 48 83 c4 38 5b 5d 41 5c c3 0f b7 44
24 02 <66> 89 5d 00 c7 45 0c 00 00 00 00 c7 45 10 00 00 00 00 66 89 45
[  982.417614] RIP: smu7_populate_single_firmware_entry.isra.5+0x89/0xe0
[amdgpu] RSP: ffffb991420d7950
[  982.417615] CR2: ffffb9ad1281a2b4
[  982.417617] ---[ end trace 095f6331aad830c9 ]---


Thanks,

Gary




More information about the vfio-users mailing list