[vfio-users] NVIDIA error: Failed to initialize DMA. Failed to allocate push buffer

Bronek Kozicki brok at incorrekt.com
Thu Dec 30 10:48:00 UTC 2021


I think here is the strongest hint; the host dmesg is floodeed with messages "BAR 1: can't reserve"


2021-12-30T10:01:59.456992+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: vfio_ecap_init: hiding ecap 0x1e at 0x258
2021-12-30T10:01:59.457413+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: vfio_ecap_init: hiding ecap 0x19 at 0x900
2021-12-30T10:01:59.457675+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref]
2021-12-30T10:01:59.486586+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.1: enabling device (0000 -> 0002)
2021-12-30T10:01:59.546592+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.3: enabling device (0000 -> 0002)

. . .

2021-12-30T10:09:58.164738+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref]
2021-12-30T10:09:58.164811+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref]


I will try adjusting host BIOS options.


B.


On Thu, 30 Dec 2021, at 10:25 AM, Bronek Kozicki wrote:
> Some more information:
>
>
> 1. driver seem to be loading fine in guest
> bronekk at euclid:~$ sudo dmesg | grep -E "nvidia|0d:00"
> [    0.810066] pci 0000:0d:00.0: [10de:1eb1] type 00 class 0x030000
> [    0.814518] pci 0000:0d:00.0: reg 0x10: [mem 0xc0000000-0xc0ffffff]
> [    0.818518] pci 0000:0d:00.0: reg 0x14: [mem 
> 0x1000000000-0x100fffffff 64bit pref]
> [    0.825110] pci 0000:0d:00.0: reg 0x1c: [mem 
> 0x1010000000-0x1011ffffff 64bit pref]
> [    0.829048] pci 0000:0d:00.0: reg 0x24: [io  0x9000-0x907f]
> [    0.834899] pci 0000:0d:00.0: PME# supported from D0 D3hot D3cold
> [    0.836042] pci 0000:0d:00.1: [10de:10f8] type 00 class 0x040300
> [    0.837841] pci 0000:0d:00.1: reg 0x10: [mem 0xc1000000-0xc1003fff]
> [    0.845020] pci 0000:0d:00.2: [10de:1ad8] type 00 class 0x0c0330
> [    0.847351] pci 0000:0d:00.2: reg 0x10: [mem 
> 0x1012000000-0x101203ffff 64bit pref]
> [    0.854518] pci 0000:0d:00.2: reg 0x1c: [mem 
> 0x1012040000-0x101204ffff 64bit pref]
> [    0.858820] pci 0000:0d:00.2: PME# supported from D0 D3hot D3cold
> [    0.862836] pci 0000:0d:00.3: [10de:1ad9] type 00 class 0x0c8000
> [    0.864838] pci 0000:0d:00.3: reg 0x10: [mem 0xc1004000-0xc1004fff]
> [    0.873964] pci 0000:0d:00.3: PME# supported from D0 D3hot D3cold
> [    0.932598] pci 0000:0d:00.0: vgaarb: VGA device added: 
> decodes=io+mem,owns=none,locks=none                                     
>                                                [    0.934523] pci 
> 0000:0d:00.0: vgaarb: bridge control possible                           
>                                                                         
>                [    0.936134] pci 0000:0d:00.0: vgaarb: setting as boot 
> device (VGA legacy resources not available)                             
>                                                  [    1.440190] pci 
> 0000:0d:00.1: D0 power state depends on 0000:0d:00.0                    
>                                                                         
>                [    1.441170] pci 0000:0d:00.2: D0 power state depends 
> on 0000:0d:00.0                                                         
>                                                   [    1.443582] pci 
> 0000:0d:00.3: D0 power state depends on 0000:0d:00.0
> [    2.619525] xhci_hcd 0000:0d:00.2: xHCI Host Controller
> [    2.620624] xhci_hcd 0000:0d:00.2: new USB bus registered, assigned 
> bus number 11
> [    2.622792] xhci_hcd 0000:0d:00.2: hcc params 0x0180ff05 hci version 
> 0x110 quirks 0x0000000000000010
> [    2.672211] usb usb11: SerialNumber: 0000:0d:00.2
> [    2.676422] xhci_hcd 0000:0d:00.2: xHCI Host Controller
> [    2.677944] xhci_hcd 0000:0d:00.2: new USB bus registered, assigned 
> bus number 12
> [    2.681209] xhci_hcd 0000:0d:00.2: Host supports USB 3.1 Enhanced 
> SuperSpeed
> [    2.705956] usb usb12: SerialNumber: 0000:0d:00.2
> [    3.926249] nvidia: loading out-of-tree module taints kernel.
> [    3.927118] nvidia: module license 'NVIDIA' taints kernel.
> [    3.938804] nvidia: module verification failed: signature and/or 
> required key missing - tainting kernel
> [    3.966693] nvidia-nvlink: Nvlink Core is being initialized, major 
> device number 249
> [    3.971181] nvidia 0000:0d:00.0: vgaarb: changed VGA decodes: 
> olddecodes=io+mem,decodes=none:owns=none
> [    4.070078] nvidia-modeset: Loading NVIDIA Kernel Mode Setting 
> Driver for UNIX platforms  460.91.03  Fri Jul  2 05:43:38 UTC 2021
> [    4.349705] [drm] [nvidia-drm] [GPU ID 0x00000d00] Loading driver
> [    4.352647] [drm] Initialized nvidia-drm 0.0.0 20160202 for 
> 0000:0d:00.0 on minor 0
> [    4.527067] audit: type=1400 audit(1640858541.112:5): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="nvidia_modprobe" pid=650 comm="apparmor_parser"
> [    4.527073] audit: type=1400 audit(1640858541.112:6): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="nvidia_modprobe//kmod" pid=650 comm="apparmor_parser"
> [    4.915963] snd_hda_intel 0000:0d:00.1: Disabling MSI
> [    4.954737] snd_hda_intel 0000:0d:00.1: Handle vga_switcheroo audio 
> client
> [    5.244486] input: HDA NVidia HDMI/DP,pcm=3 as 
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input6
> [    5.247732] input: HDA NVidia HDMI/DP,pcm=7 as 
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input7
> [    5.250636] input: HDA NVidia HDMI/DP,pcm=8 as 
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input8
> [    5.253520] input: HDA NVidia HDMI/DP,pcm=9 as 
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input9
> [    5.256445] input: HDA NVidia HDMI/DP,pcm=10 as 
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input10
> [    5.259401] input: HDA NVidia HDMI/DP,pcm=11 as 
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input11
> [    5.262271] input: HDA NVidia HDMI/DP,pcm=12 as 
> /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input12
>
>
> bronekk at euclid:~$ sudo nvidia-smi
> Thu Dec 30 10:04:48 2021
> +-----------------------------------------------------------------------------+
> | NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
> |-------------------------------+----------------------+----------------------+
> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
> |                               |                      |               MIG M. |
> |===============================+======================+======================|
> |   0  Quadro RTX 4000     On   | 00000000:0D:00.0 Off |                  N/A |
> | 30%   39C    P8     3W / 125W |      1MiB /  7982MiB |      0%      Default |
> |                               |                      |                  N/A |
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
> | Processes:                                                                  |
> |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
> |        ID   ID                                                   Usage      |
> |=============================================================================|
> |  No running processes found                                                 |
> +-----------------------------------------------------------------------------+
>
> bronekk at euclid:~$ sudo lspci -vnn -s 0d:00.0
> 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL 
> [Quadro RTX 4000] [10de:1eb1] (rev a1) (prog-if 00 [VGA controller])
>         Subsystem: Dell TU104GL [Quadro RTX 4000] [1028:12a0]
>         Physical Slot: 0-12
>         Flags: bus master, fast devsel, latency 0, IRQ 116
>         Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at 1000000000 (64-bit, prefetchable) [size=256M]
>         Memory at 1010000000 (64-bit, prefetchable) [size=32M]
>         I/O ports at 9000 [size=128]
>         Capabilities: [60] Power Management version 3
>         Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>         Capabilities: [78] Express Legacy Endpoint, MSI 00
>         Capabilities: [100] Virtual Channel
>         Capabilities: [250] Latency Tolerance Reporting
>         Capabilities: [128] Power Budgeting <?>
>         Capabilities: [420] Advanced Error Reporting
>         Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 
> Len=024 <?>
>         Kernel driver in use: nvidia
>         Kernel modules: nvidia
>
>
> 2. host should not be trying to access the card:
>
> bronekk at gauss ~ % sudo lspci -vnn -s 81:00.0
> 81:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL 
> [Quadro RTX 4000] [10de:1eb1] (rev a1) (prog-if 00 [VGA controller])
>         Subsystem: Dell Device [1028:12a0]
>         Flags: bus master, fast devsel, latency 0, IRQ 381, IOMMU group 
> 31
>         Memory at bc000000 (32-bit, non-prefetchable) [size=16M]
>         Memory at 20000000000 (64-bit, prefetchable) [size=256M]
>         Memory at 20010000000 (64-bit, prefetchable) [size=32M]
>         I/O ports at b000 [size=128]
>         Expansion ROM at bd000000 [disabled] [size=512K]
>         Capabilities: [60] Power Management version 3
>         Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
>         Capabilities: [78] Express Legacy Endpoint, MSI 00
>         Capabilities: [100] Virtual Channel
>         Capabilities: [258] L1 PM Substates
>         Capabilities: [128] Power Budgeting <?>
>         Capabilities: [420] Advanced Error Reporting
>         Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 
> Len=024 <?>
>         Capabilities: [900] Secondary PCI Express
>         Capabilities: [bb0] Physical Resizable BAR
>         Kernel driver in use: vfio-pci
>         Kernel modules: nouveau
>
>
> bronekk at gauss ~ % sudo cat /etc/modprobe.d/40-blacklist.conf
> # This host is headless, prevent any modules from attaching to video hardware
>
> # NVIDIA
> blacklist nouveau
> blacklist nvidia
>
> # AMD
> blacklist radeon
> blacklist amdgpu
> blacklist amdkfd
> blacklist fglrx
>
> # HDMI sound on a GPU
> blacklist snd_hda_intel
>
> # Framebuffers (ALL of them)
> blacklist vesafb
> blacklist aty128fb
> blacklist atyfb
> blacklist radeonfb
> blacklist cirrusfb
> blacklist cyber2000fb
> blacklist cyblafb
> blacklist gx1fb
> blacklist hgafb
> blacklist i810fb
> blacklist intelfb
> blacklist kyrofb
> blacklist lxfb
> blacklist matroxfb_base
> blacklist neofb
> blacklist nvidiafb
> blacklist pm2fb
> blacklist rivafb
> blacklist s1d13xxxfb
> blacklist savagefb
> blacklist sisfb
> blacklist sstfb
> blacklist tdfxfb
> blacklist tridentfb
> blacklist vfb
> blacklist viafb
> blacklist vt8623fb
> blacklist udlfb
>
> bronekk at gauss ~ % sudo cat /etc/modprobe.d/30-vfio.conf
> # 10de:* are NVIDIA
> # 1912:0015 is Renesas Technology Corp. uPD720202 USB 3.0 Host Controller
> options vfio-pci ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9,1912:0015
> options vfio-pci disable_vga=1
>
> bronekk at gauss ~ % sudo lspci -nn | grep -F "10de:"
> 81:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL 
> [Quadro RTX 4000] [10de:1eb1] (rev a1)
> 81:00.1 Audio device [0403]: NVIDIA Corporation TU104 HD Audio 
> Controller [10de:10f8] (rev a1)
> 81:00.2 USB controller [0c03]: NVIDIA Corporation TU104 USB 3.1 Host 
> Controller [10de:1ad8] (rev a1)
> 81:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB 
> Type-C UCSI Controller [10de:1ad9] (rev a1)
>
> 3. device mapping in libvirt:
>
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x81' slot='0x00' function='0x0'/>
>       </source>
>       <rom bar='off'/>
>       <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' 
> function='0x0' multifunction='on'/>
>     </hostdev>
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x81' slot='0x00' function='0x1'/>
>       </source>
>       <rom bar='off'/>
>       <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' 
> function='0x1'/>
>     </hostdev>
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x81' slot='0x00' function='0x2'/>
>       </source>
>       <rom bar='off'/>
>       <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' 
> function='0x2'/>
>     </hostdev>
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <driver name='vfio'/>
>       <source>
>         <address domain='0x0000' bus='0x81' slot='0x00' function='0x3'/>
>       </source>
>       <rom bar='off'/>
>       <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' 
> function='0x3'/>
>     </hostdev>
>
>
> 4. something is definitely wrong inside the guest, since I am getting these:
>
> [ 1236.179163] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! 
> [Xorg:2982]
> [ 1236.179961] Modules linked in: hid_generic usbhid hid rfkill 
> snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg soundwire_intel 
> soundwire_generic_allocation snd_soc_core ghash_clmulni_intel 
> snd_compress soundwire_cadence nls_ascii snd_hda_codec nls_cp437 vfat 
> fat aesni_intel snd_hda_core libaes snd_hwdep crypto_simd soundwire_bus 
> cryptd nvidia_drm(POE) snd_pcm glue_helper snd_timer drm_kms_helper snd 
> iTCO_wdt intel_pmc_bxt joydev iTCO_vendor_support sg serio_raw cec 
> watchdog soundcore virtio_console virtio_balloon pcspkr evdev 
> efi_pstore qemu_fw_cfg nvidia_modeset(POE) nvidia(POE) drm fuse 
> configfs efivarfs virtio_rng rng_core ip_tables x_tables autofs4 ext4 
> crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi sr_mod crc_t10dif cdrom 
> crct10dif_generic ahci libahci xhci_pci libata xhci_hcd virtio_scsi 
> virtio_net net_failover failover scsi_mod usbcore crct10dif_pclmul 
> psmouse crct10dif_common crc32_pclmul crc32c_intel i2c_i801 virtio_pci 
> lpc_ich i2c_smbus virtio_ring usb_common virtio but
>  ton
> [ 1236.189681] CPU: 12 PID: 2982 Comm: Xorg Tainted: P           OEL    
> 5.10.0-10-amd64 #1 Debian 5.10.84-1
> [ 1236.190725] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 0.0.0 02/06/2015
> [ 1236.191711] RIP: 0010:_nv032887rm+0x12/0x40 [nvidia]
> [ 1236.192286] Code: d2 0e 31 c0 e8 af 7d 78 ff e8 ca 3c eb ff 31 c0 48 
> 83 c4 08 c3 0f 1f 00 48 83 ec 08 39 4a 10 76 17 48 8b 02 c1 e9 02 8b 04 
> 88 <48> 83 c4 08 c3 66 0f 1f 84 00 00 00 00 00 be 00 00 d5 09 bf 0a ad
> [ 1236.194379] RSP: 0018:ffffa9b840f6ba98 EFLAGS: 00000256
> [ 1236.194977] RAX: 00000000164000a1 RBX: 0000000000000020 RCX: 
> 0000000000000000
> [ 1236.195804] RDX: ffff9995889fd0a0 RSI: ffff9995889fc008 RDI: 
> ffff99958b67d008
> [ 1236.196617] RBP: ffff999586b02a00 R08: 0000000000000020 R09: 
> 0000000000000000
> [ 1236.197425] R10: ffff9995889fc008 R11: ffff9995889fd0a0 R12: 
> 0000000000000000
> [ 1236.198217] R13: 0000000000000000 R14: 0000000000000000 R15: 
> ffff9995889fc008
> [ 1236.199011] FS:  00007f7bcbbd6a40(0000) GS:ffff999cdfb00000(0000) 
> knlGS:0000000000000000
> [ 1236.199931] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1236.200587] CR2: 0000564dd482a3a8 CR3: 0000000102d80005 CR4: 
> 0000000000770ee0
> [ 1236.201392] PKRU: 55555554
> [ 1236.201699] Call Trace:
> [ 1236.202118]  ? _nv009235rm+0x1f1/0x230 [nvidia]
> [ 1236.202763]  ? _nv036126rm+0x62/0x70 [nvidia]
> [ 1236.203393]  ? _nv028825rm+0x46/0x4a0 [nvidia]
> [ 1236.204041]  ? _nv009323rm+0x7b/0x90 [nvidia]
> [ 1236.204667]  ? _nv009319rm+0xfb/0x4f0 [nvidia]
> [ 1236.205302]  ? _nv037231rm+0xfd/0x180 [nvidia]
> [ 1236.205939]  ? _nv034489rm+0x248/0x370 [nvidia]
> [ 1236.206528]  ? _nv009448rm+0x3d/0x90 [nvidia]
> [ 1236.207153]  ? _nv029075rm+0x14c/0x670 [nvidia]
> [ 1236.207759]  ? _nv028910rm+0x520/0x900 [nvidia]
> [ 1236.208378]  ? _nv002525rm+0x9/0x20 [nvidia]
> [ 1236.208966]  ? _nv003517rm+0x1b/0x80 [nvidia]
> [ 1236.209551]  ? _nv013021rm+0x6fe/0x770 [nvidia]
> [ 1236.210149]  ? _nv038021rm+0xb3/0x150 [nvidia]
> [ 1236.210736]  ? _nv038020rm+0x388/0x4e0 [nvidia]
> [ 1236.211336]  ? _nv036312rm+0xbe/0x140 [nvidia]
> [ 1236.211939]  ? _nv036313rm+0x42/0x70 [nvidia]
> [ 1236.212525]  ? _nv008273rm+0x4b/0x90 [nvidia]
> [ 1236.213117]  ? _nv000709rm+0x4ef/0x880 [nvidia]
> [ 1236.213709]  ? rm_ioctl+0x54/0xb0 [nvidia]
> [ 1236.214228]  ? nvidia_ioctl+0x66c/0x880 [nvidia]
> [ 1236.214816]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
> [ 1236.215516]  ? __x64_sys_ioctl+0x83/0xb0
> [ 1236.215972]  ? do_syscall_64+0x33/0x80
> [ 1236.216405]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> On Wed, 29 Dec 2021, at 11:16 PM, Bronek Kozicki wrote:
>> Hi
>>
>> Hoping someone solved this one before.
>>
>> My host if Epyc Milan, running on Asrock ROMED8-2T, GPU is NVIDIA 
>> Quadro RTX 4000, running on fresh Arch Linux install. The guest is 
>> Debian 11 , with NVIDIA-460 drivers . I can see the drivers are 
>> correctly loaded in the guest (with nvidia-smi), but Xorg fails to 
>> initialize. The /var/log/Xorg.0.log tail is:
>>
>>
>> [   254.714] (II) NVIDIA: Using 24576.00 MB of virtual memory for 
>> indirect memory
>> [   254.714] (II) NVIDIA:     access.
>> [   257.719] (EE) NVIDIA(GPU-0): Failed to initialize DMA.
>> [   257.720] (EE) NVIDIA(0): Failed to allocate push buffer
>> [   257.829] (EE) 
>> Fatal server error:
>> [   257.829] (EE) AddScreen/ScreenInit failed for driver 0
>> [   257.829] (EE) 
>> [   257.829] (EE) 
>> Please consult the The X.Org Foundation support 
>> 	 at http://wiki.x.org
>>  for help. 
>> [   257.829] (EE) Please also check the log file at 
>> "/var/log/Xorg.0.log" for additional information.
>> [   257.829] (EE) 
>> [   257.829] (EE) Server terminated with error (1). Closing log file.
>>
>> I am running similar configuration (same card, also Debian 11 and 
>> nvidia-460 drivers) on a different host, with an older Intel Xeon CPU. 
>> No problems there.
>>
>> Any hints?
>>
>>
>> B.
>>
>> -- 
>>   Bronek Kozicki
>>   brok at incorrekt.com
>>
>> _______________________________________________
>> vfio-users mailing list
>> vfio-users at redhat.com
>> https://listman.redhat.com/mailman/listinfo/vfio-users
>
> -- 
>   Bronek Kozicki
>   brok at incorrekt.com
>
>
> _______________________________________________
> vfio-users mailing list
> vfio-users at redhat.com
> https://listman.redhat.com/mailman/listinfo/vfio-users

-- 
  Bronek Kozicki
  brok at incorrekt.com




More information about the vfio-users mailing list