Passthrough PCI GPU device fails on reboot

Tue Jul 27 10:29:36 UTC 2021

On Tue, Jul 27, 2021 at 11:22:25AM +0200, Francesc Guasch wrote:
> Hello.
> 
> I have a host with an NVIDIA RTX 3090. I configured PCI passthrough
> and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04.
> 
> The problem comes sometimes on rebooting the virtual machine. It doesn't
> happen 100% of the times but eventually after 3 or 4 reboots the PCI
> device stops working. The only solution is to reboot the host.
> 
> Weird thing is this only happens when rebooting the VM. After a host
> reboot if we shutdown the virtual machine and we start it again,
> it works fine. I wrote a small script that does that a hundred times
> just to make sure. Only a reboot triggers the problem.
> 
> When it fails I run "nvidia-smi" in the virtual machine and I get:
> 
>     No devices were found
> 
> Also I spotted some errors in syslog
> 
>    NVRM: installed in this system is not supported by the
>    NVIDIA 460.91.03 driver release.
>    NVRM: GPU 0000:01:01.0: GPU has fallen off the bus
>    NVRM: the NVIDIA kernel module is unloaded.
>    NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x23:0x65:1204)
>    NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0
> 
> The device is there because typing lspci I can see information:
> 
>     0000:01:01.0 VGA compatible controller [0300]: NVIDIA Corporation
>     Device [10de:2204] (rev a1)
> 	Subsystem: Gigabyte Technology Co., Ltd Device [1458:403b]
> 	Kernel driver in use: nvidia
> 	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
> 
> I tried different Nvidia drivers and Linux kernels in the host and
> the virtual machine with the same results.

Hi,
this question is better suited for vfio-users at redhat.com. Once the GPU is bound
to the vfio-pci driver, it's out of libvirt's hands.
AFAIR Nvidia only enabled PCI device assignment on GeForce cards on Windows 10
VMs, but you claim to run a Linux VM. Back when I worked on the vGPU stuff that
is supported only on the Tesla cards, I remember being told that the host and
guest driver communicated with each other. Applying the same to GeForce, I
would not be surprised if NVIDIA detected in the host driver that the
corresponding guest driver is not a Windows 10 one and didn't do a proper GPU
reset in between VM reboots - hence the need to reboot the host. There used to
be a similar bus reset bug in the AMD host driver not so long ago which
affected every single VM shutdown/reboot in a way that the host had to be
rebooted in order for the card to be usable again. Be it as it may, I can only
speculate and since your scenario is officially not supported by NVIDIA I wish
you the best of luck :)

Regards,
Erik