[vfio-users] NVIDIA error: Failed to initialize DMA. Failed to allocate push buffer
Bronek Kozicki
brok at incorrekt.com
Thu Dec 30 10:48:00 UTC 2021
I think here is the strongest hint; the host dmesg is floodeed with messages "BAR 1: can't reserve"
2021-12-30T10:01:59.456992+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: vfio_ecap_init: hiding ecap 0x1e at 0x258
2021-12-30T10:01:59.457413+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: vfio_ecap_init: hiding ecap 0x19 at 0x900
2021-12-30T10:01:59.457675+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref]
2021-12-30T10:01:59.486586+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.1: enabling device (0000 -> 0002)
2021-12-30T10:01:59.546592+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.3: enabling device (0000 -> 0002)
. . .
2021-12-30T10:09:58.164738+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref]
2021-12-30T10:09:58.164811+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref]
I will try adjusting host BIOS options.
B.
On Thu, 30 Dec 2021, at 10:25 AM, Bronek Kozicki wrote:
> Some more information:
>
>
> 1. driver seem to be loading fine in guest
> bronekk at euclid:~$ sudo dmesg | grep -E "nvidia|0d:00"
> [ 0.810066] pci 0000:0d:00.0: [10de:1eb1] type 00 class 0x030000
> [ 0.814518] pci 0000:0d:00.0: reg 0x10: [mem 0xc0000000-0xc0ffffff]
> [ 0.818518] pci 0000:0d:00.0: reg 0x14: [mem
> 0x1000000000-0x100fffffff 64bit pref]
> [ 0.825110] pci 0000:0d:00.0: reg 0x1c: [mem
> 0x1010000000-0x1011ffffff 64bit pref]
> [ 0.829048] pci 0000:0d:00.0: reg 0x24: [io 0x9000-0x907f]
> [ 0.834899] pci 0000:0d:00.0: PME# supported from D0 D3hot D3cold
> [ 0.836042] pci 0000:0d:00.1: [10de:10f8] type 00 class 0x040300
> [ 0.837841] pci 0000:0d:00.1: reg 0x10: [mem 0xc1000000-0xc1003fff]
> [ 0.845020] pci 0000:0d:00.2: [10de:1ad8] type 00 class 0x0c0330
> [ 0.847351] pci 0000:0d:00.2: reg 0x10: [mem
> 0x1012000000-0x101203ffff 64bit pref]
> [ 0.854518] pci 0000:0d:00.2: reg 0x1c: [mem
> 0x1012040000-0x101204ffff 64bit pref]
> [ 0.858820] pci 0000:0d:00.2: PME# supported from D0 D3hot D3cold
> [ 0.862836] pci 0000:0d:00.3: [10de:1ad9] type 00 class 0x0c8000
> [ 0.864838] pci 0000:0d:00.3: reg 0x10: [mem 0xc1004000-0xc1004fff]
> [ 0.873964] pci 0000:0d:00.3: PME# supported from D0 D3hot D3cold
> [ 0.932598] pci 0000:0d:00.0: vgaarb: VGA device added:
> decodes=io+mem,owns=none,locks=none
> [ 0.934523] pci
> 0000:0d:00.0: vgaarb: bridge control possible
>
> [ 0.936134] pci 0000:0d:00.0: vgaarb: setting as boot
> device (VGA legacy resources not available)
> [ 1.440190] pci
> 0000:0d:00.1: D0 power state depends on 0000:0d:00.0
>
> [ 1.441170] pci 0000:0d:00.2: D0 power state depends
> on 0000:0d:00.0
> [ 1.443582] pci
> 0000:0d:00.3: D0 power state depends on 0000:0d:00.0
> [ 2.619525] xhci_hcd 0000:0d:00.2: xHCI Host Controller
> [ 2.620624] xhci_hcd 0000:0d:00.2: new USB bus registered, assigned
> bus number 11
> [ 2.622792] xhci_hcd 0000:0d:00.2: hcc params 0x0180ff05 hci version
> 0x110 quirks 0x0000000000000010
> [ 2.672211] usb usb11: SerialNumber: 0000:0d:00.2
> [ 2.676422] xhci_hcd 0000:0d:00.2: xHCI Host Controller
> [ 2.677944] xhci_hcd 0000:0d:00.2: new USB bus registered, assigned
> bus number 12
> [ 2.681209] xhci_hcd 0000:0d:00.2: Host supports USB 3.1 Enhanced
> SuperSpeed
> [ 2.705956] usb usb12: SerialNumber: 0000:0d:00.2
> [ 3.926249] nvidia: loading out-of-tree module taints kernel.
> [ 3.927118] nvidia: module license 'NVIDIA' taints kernel.
> [ 3.938804] nvidia: module verification failed: signature and/or
> required key missing - tainting kernel
> [ 3.966693] nvidia-nvlink: Nvlink Core is being initialized, major
> device number 249
> [ 3.971181] nvidia 0000:0d:00.0: vgaarb: changed VGA decodes:
> olddecodes=io+mem,decodes=none:owns=none
> [ 4.070078] nvidia-modeset: Loading NVIDIA Kernel Mode Setting
> Driver for UNIX platforms 460.91.03 Fri Jul 2 05:43:38 UTC 2021
> [ 4.349705] [drm] [nvidia-drm] [GPU ID 0x00000d00] Loading driver
> [ 4.352647] [drm] Initialized nvidia-drm 0.0.0 20160202 for
> 0000:0d:00.0 on minor 0
> [ 4.527067] audit: type=1400 audit(1640858541.112:5):
> apparmor="STATUS" operation="profile_load" profile="unconfined"
> name="nvidia_modprobe" pid=650 comm="apparmor_parser"
> [ 4.527073] audit: type=1400 audit(1640858541.112:6):
> apparmor="STATUS" operation="profile_load" profile="unconfined"
> name="nvidia_modprobe//kmod" pid=650 comm="apparmor_parser"
> [ 4.915963] snd_hda_intel 0000:0d:00.1: Disabling MSI
> [ 4.954737] snd_hda_intel 0000:0d:00.1: Handle vga_switcheroo audio
> client
> [ 5.244486] input: HDA NVidia HDMI/DP,pcm=3 as
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input6
> [ 5.247732] input: HDA NVidia HDMI/DP,pcm=7 as
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input7
> [ 5.250636] input: HDA NVidia HDMI/DP,pcm=8 as
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input8
> [ 5.253520] input: HDA NVidia HDMI/DP,pcm=9 as
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input9
> [ 5.256445] input: HDA NVidia HDMI/DP,pcm=10 as
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input10
> [ 5.259401] input: HDA NVidia HDMI/DP,pcm=11 as
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input11
> [ 5.262271] input: HDA NVidia HDMI/DP,pcm=12 as
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input12
>
>
> bronekk at euclid:~$ sudo nvidia-smi
> Thu Dec 30 10:04:48 2021
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
> |-------------------------------+----------------------+----------------------+
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
> | | | MIG M. |
> |===============================+======================+======================|
> | 0 Quadro RTX 4000 On | 00000000:0D:00.0 Off | N/A |
> | 30% 39C P8 3W / 125W | 1MiB / 7982MiB | 0% Default |
> | | | N/A |
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes: |
> | GPU GI CI PID Type Process name GPU Memory |
> | ID ID Usage |
> |=============================================================================|
> | No running processes found |
> +-----------------------------------------------------------------------------+
>
> bronekk at euclid:~$ sudo lspci -vnn -s 0d:00.0
> 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL
> [Quadro RTX 4000] [10de:1eb1] (rev a1) (prog-if 00 [VGA controller])
> Subsystem: Dell TU104GL [Quadro RTX 4000] [1028:12a0]
> Physical Slot: 0-12
> Flags: bus master, fast devsel, latency 0, IRQ 116
> Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
> Memory at 1000000000 (64-bit, prefetchable) [size=256M]
> Memory at 1010000000 (64-bit, prefetchable) [size=32M]
> I/O ports at 9000 [size=128]
> Capabilities: [60] Power Management version 3
> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Capabilities: [78] Express Legacy Endpoint, MSI 00
> Capabilities: [100] Virtual Channel
> Capabilities: [250] Latency Tolerance Reporting
> Capabilities: [128] Power Budgeting <?>
> Capabilities: [420] Advanced Error Reporting
> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1
> Len=024 <?>
> Kernel driver in use: nvidia
> Kernel modules: nvidia
>
>
> 2. host should not be trying to access the card:
>
> bronekk at gauss ~ % sudo lspci -vnn -s 81:00.0
> 81:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL
> [Quadro RTX 4000] [10de:1eb1] (rev a1) (prog-if 00 [VGA controller])
> Subsystem: Dell Device [1028:12a0]
> Flags: bus master, fast devsel, latency 0, IRQ 381, IOMMU group
> 31
> Memory at bc000000 (32-bit, non-prefetchable) [size=16M]
> Memory at 20000000000 (64-bit, prefetchable) [size=256M]
> Memory at 20010000000 (64-bit, prefetchable) [size=32M]
> I/O ports at b000 [size=128]
> Expansion ROM at bd000000 [disabled] [size=512K]
> Capabilities: [60] Power Management version 3
> Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Capabilities: [78] Express Legacy Endpoint, MSI 00
> Capabilities: [100] Virtual Channel
> Capabilities: [258] L1 PM Substates
> Capabilities: [128] Power Budgeting <?>
> Capabilities: [420] Advanced Error Reporting
> Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1
> Len=024 <?>
> Capabilities: [900] Secondary PCI Express
> Capabilities: [bb0] Physical Resizable BAR
> Kernel driver in use: vfio-pci
> Kernel modules: nouveau
>
>
> bronekk at gauss ~ % sudo cat /etc/modprobe.d/40-blacklist.conf
> # This host is headless, prevent any modules from attaching to video hardware
>
> # NVIDIA
> blacklist nouveau
> blacklist nvidia
>
> # AMD
> blacklist radeon
> blacklist amdgpu
> blacklist amdkfd
> blacklist fglrx
>
> # HDMI sound on a GPU
> blacklist snd_hda_intel
>
> # Framebuffers (ALL of them)
> blacklist vesafb
> blacklist aty128fb
> blacklist atyfb
> blacklist radeonfb
> blacklist cirrusfb
> blacklist cyber2000fb
> blacklist cyblafb
> blacklist gx1fb
> blacklist hgafb
> blacklist i810fb
> blacklist intelfb
> blacklist kyrofb
> blacklist lxfb
> blacklist matroxfb_base
> blacklist neofb
> blacklist nvidiafb
> blacklist pm2fb
> blacklist rivafb
> blacklist s1d13xxxfb
> blacklist savagefb
> blacklist sisfb
> blacklist sstfb
> blacklist tdfxfb
> blacklist tridentfb
> blacklist vfb
> blacklist viafb
> blacklist vt8623fb
> blacklist udlfb
>
> bronekk at gauss ~ % sudo cat /etc/modprobe.d/30-vfio.conf
> # 10de:* are NVIDIA
> # 1912:0015 is Renesas Technology Corp. uPD720202 USB 3.0 Host Controller
> options vfio-pci ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9,1912:0015
> options vfio-pci disable_vga=1
>
> bronekk at gauss ~ % sudo lspci -nn | grep -F "10de:"
> 81:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL
> [Quadro RTX 4000] [10de:1eb1] (rev a1)
> 81:00.1 Audio device [0403]: NVIDIA Corporation TU104 HD Audio
> Controller [10de:10f8] (rev a1)
> 81:00.2 USB controller [0c03]: NVIDIA Corporation TU104 USB 3.1 Host
> Controller [10de:1ad8] (rev a1)
> 81:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB
> Type-C UCSI Controller [10de:1ad9] (rev a1)
>
> 3. device mapping in libvirt:
>
> <hostdev mode='subsystem' type='pci' managed='yes'>
> <driver name='vfio'/>
> <source>
> <address domain='0x0000' bus='0x81' slot='0x00' function='0x0'/>
> </source>
> <rom bar='off'/>
> <address type='pci' domain='0x0000' bus='0x0d' slot='0x00'
> function='0x0' multifunction='on'/>
> </hostdev>
> <hostdev mode='subsystem' type='pci' managed='yes'>
> <driver name='vfio'/>
> <source>
> <address domain='0x0000' bus='0x81' slot='0x00' function='0x1'/>
> </source>
> <rom bar='off'/>
> <address type='pci' domain='0x0000' bus='0x0d' slot='0x00'
> function='0x1'/>
> </hostdev>
> <hostdev mode='subsystem' type='pci' managed='yes'>
> <driver name='vfio'/>
> <source>
> <address domain='0x0000' bus='0x81' slot='0x00' function='0x2'/>
> </source>
> <rom bar='off'/>
> <address type='pci' domain='0x0000' bus='0x0d' slot='0x00'
> function='0x2'/>
> </hostdev>
> <hostdev mode='subsystem' type='pci' managed='yes'>
> <driver name='vfio'/>
> <source>
> <address domain='0x0000' bus='0x81' slot='0x00' function='0x3'/>
> </source>
> <rom bar='off'/>
> <address type='pci' domain='0x0000' bus='0x0d' slot='0x00'
> function='0x3'/>
> </hostdev>
>
>
> 4. something is definitely wrong inside the guest, since I am getting these:
>
> [ 1236.179163] watchdog: BUG: soft lockup - CPU#12 stuck for 23s!
> [Xorg:2982]
> [ 1236.179961] Modules linked in: hid_generic usbhid hid rfkill
> snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg soundwire_intel
> soundwire_generic_allocation snd_soc_core ghash_clmulni_intel
> snd_compress soundwire_cadence nls_ascii snd_hda_codec nls_cp437 vfat
> fat aesni_intel snd_hda_core libaes snd_hwdep crypto_simd soundwire_bus
> cryptd nvidia_drm(POE) snd_pcm glue_helper snd_timer drm_kms_helper snd
> iTCO_wdt intel_pmc_bxt joydev iTCO_vendor_support sg serio_raw cec
> watchdog soundcore virtio_console virtio_balloon pcspkr evdev
> efi_pstore qemu_fw_cfg nvidia_modeset(POE) nvidia(POE) drm fuse
> configfs efivarfs virtio_rng rng_core ip_tables x_tables autofs4 ext4
> crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi sr_mod crc_t10dif cdrom
> crct10dif_generic ahci libahci xhci_pci libata xhci_hcd virtio_scsi
> virtio_net net_failover failover scsi_mod usbcore crct10dif_pclmul
> psmouse crct10dif_common crc32_pclmul crc32c_intel i2c_i801 virtio_pci
> lpc_ich i2c_smbus virtio_ring usb_common virtio but
> ton
> [ 1236.189681] CPU: 12 PID: 2982 Comm: Xorg Tainted: P OEL
> 5.10.0-10-amd64 #1 Debian 5.10.84-1
> [ 1236.190725] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 0.0.0 02/06/2015
> [ 1236.191711] RIP: 0010:_nv032887rm+0x12/0x40 [nvidia]
> [ 1236.192286] Code: d2 0e 31 c0 e8 af 7d 78 ff e8 ca 3c eb ff 31 c0 48
> 83 c4 08 c3 0f 1f 00 48 83 ec 08 39 4a 10 76 17 48 8b 02 c1 e9 02 8b 04
> 88 <48> 83 c4 08 c3 66 0f 1f 84 00 00 00 00 00 be 00 00 d5 09 bf 0a ad
> [ 1236.194379] RSP: 0018:ffffa9b840f6ba98 EFLAGS: 00000256
> [ 1236.194977] RAX: 00000000164000a1 RBX: 0000000000000020 RCX:
> 0000000000000000
> [ 1236.195804] RDX: ffff9995889fd0a0 RSI: ffff9995889fc008 RDI:
> ffff99958b67d008
> [ 1236.196617] RBP: ffff999586b02a00 R08: 0000000000000020 R09:
> 0000000000000000
> [ 1236.197425] R10: ffff9995889fc008 R11: ffff9995889fd0a0 R12:
> 0000000000000000
> [ 1236.198217] R13: 0000000000000000 R14: 0000000000000000 R15:
> ffff9995889fc008
> [ 1236.199011] FS: 00007f7bcbbd6a40(0000) GS:ffff999cdfb00000(0000)
> knlGS:0000000000000000
> [ 1236.199931] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1236.200587] CR2: 0000564dd482a3a8 CR3: 0000000102d80005 CR4:
> 0000000000770ee0
> [ 1236.201392] PKRU: 55555554
> [ 1236.201699] Call Trace:
> [ 1236.202118] ? _nv009235rm+0x1f1/0x230 [nvidia]
> [ 1236.202763] ? _nv036126rm+0x62/0x70 [nvidia]
> [ 1236.203393] ? _nv028825rm+0x46/0x4a0 [nvidia]
> [ 1236.204041] ? _nv009323rm+0x7b/0x90 [nvidia]
> [ 1236.204667] ? _nv009319rm+0xfb/0x4f0 [nvidia]
> [ 1236.205302] ? _nv037231rm+0xfd/0x180 [nvidia]
> [ 1236.205939] ? _nv034489rm+0x248/0x370 [nvidia]
> [ 1236.206528] ? _nv009448rm+0x3d/0x90 [nvidia]
> [ 1236.207153] ? _nv029075rm+0x14c/0x670 [nvidia]
> [ 1236.207759] ? _nv028910rm+0x520/0x900 [nvidia]
> [ 1236.208378] ? _nv002525rm+0x9/0x20 [nvidia]
> [ 1236.208966] ? _nv003517rm+0x1b/0x80 [nvidia]
> [ 1236.209551] ? _nv013021rm+0x6fe/0x770 [nvidia]
> [ 1236.210149] ? _nv038021rm+0xb3/0x150 [nvidia]
> [ 1236.210736] ? _nv038020rm+0x388/0x4e0 [nvidia]
> [ 1236.211336] ? _nv036312rm+0xbe/0x140 [nvidia]
> [ 1236.211939] ? _nv036313rm+0x42/0x70 [nvidia]
> [ 1236.212525] ? _nv008273rm+0x4b/0x90 [nvidia]
> [ 1236.213117] ? _nv000709rm+0x4ef/0x880 [nvidia]
> [ 1236.213709] ? rm_ioctl+0x54/0xb0 [nvidia]
> [ 1236.214228] ? nvidia_ioctl+0x66c/0x880 [nvidia]
> [ 1236.214816] ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
> [ 1236.215516] ? __x64_sys_ioctl+0x83/0xb0
> [ 1236.215972] ? do_syscall_64+0x33/0x80
> [ 1236.216405] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> On Wed, 29 Dec 2021, at 11:16 PM, Bronek Kozicki wrote:
>> Hi
>>
>> Hoping someone solved this one before.
>>
>> My host if Epyc Milan, running on Asrock ROMED8-2T, GPU is NVIDIA
>> Quadro RTX 4000, running on fresh Arch Linux install. The guest is
>> Debian 11 , with NVIDIA-460 drivers . I can see the drivers are
>> correctly loaded in the guest (with nvidia-smi), but Xorg fails to
>> initialize. The /var/log/Xorg.0.log tail is:
>>
>>
>> [ 254.714] (II) NVIDIA: Using 24576.00 MB of virtual memory for
>> indirect memory
>> [ 254.714] (II) NVIDIA: access.
>> [ 257.719] (EE) NVIDIA(GPU-0): Failed to initialize DMA.
>> [ 257.720] (EE) NVIDIA(0): Failed to allocate push buffer
>> [ 257.829] (EE)
>> Fatal server error:
>> [ 257.829] (EE) AddScreen/ScreenInit failed for driver 0
>> [ 257.829] (EE)
>> [ 257.829] (EE)
>> Please consult the The X.Org Foundation support
>> at http://wiki.x.org
>> for help.
>> [ 257.829] (EE) Please also check the log file at
>> "/var/log/Xorg.0.log" for additional information.
>> [ 257.829] (EE)
>> [ 257.829] (EE) Server terminated with error (1). Closing log file.
>>
>> I am running similar configuration (same card, also Debian 11 and
>> nvidia-460 drivers) on a different host, with an older Intel Xeon CPU.
>> No problems there.
>>
>> Any hints?
>>
>>
>> B.
>>
>> --
>> Bronek Kozicki
>> brok at incorrekt.com
>>
>> _______________________________________________
>> vfio-users mailing list
>> vfio-users at redhat.com
>> https://listman.redhat.com/mailman/listinfo/vfio-users
>
> --
> Bronek Kozicki
> brok at incorrekt.com
>
>
> _______________________________________________
> vfio-users mailing list
> vfio-users at redhat.com
> https://listman.redhat.com/mailman/listinfo/vfio-users
--
Bronek Kozicki
brok at incorrekt.com
More information about the vfio-users
mailing list