[vfio-users] cudaErrorDevicesUnavailalbe using Cuda in KVM using VFIO device passthrough

Tue May 1 20:30:02 UTC 2018

Hi Andrew,

On Tue, 1 May 2018 19:30:58 +0000
Andrew Zimmerman <atz at rincon.com> wrote:

> Alex,
> 
> Thank you for your reply and all of your ideas.  You are right that
> the SXM uses NVLink - I had not thought of that as a potential
> culprit.  I do not have any PCIe GPUs in this cluster, but I may be
> able to setup a standalone test on an older box.

I tried it on a RHEL7.5 host, RHEL7.5 guest, assigned (PCIe) Telsa P4,
driver 390.46, cuda-samples-9-1:

# ./simpleAtomicIntrinsics 
simpleAtomicIntrinsics starting...
GPU Device 0: "Tesla P4" with compute capability 6.1

> GPU device has 20 Multi-Processors, SM 6.1 compute capabilities

Processing time: 114.939003 (ms)
simpleAtomicIntrinsics completed, returned OK

> I have not seen a specific mention from NVIDIA regarding VFIO support
> for this form factor of the Tesla V100, but there were talks at GTC
> regarding using Tesla cards with VFIO.

Yes, we (RH & NVIDIA) support assignment of Tesla, GRID, and
sufficiently expensive Quadro cards with vfio, and the vGPU framework
for KVM is built on vfio, but all of this is only for PCIe based
devices AFAIK.

> Do you know of a better guide you could point me to for getting up
> and running with VFIO?  I was thinking that it felt like a
> permissions issue (as I can query the device, but not write to it),
> so it could be an issue with how it had me set up the ACLs...

Those ACLs were only for the host, you can't do device assignment
without the guest having full access to the device, so if you can
assign the device, those ACLs are not the problem.  If you started with
just a RHEL/Centos 7.4 installed as a hypervisor and you have somewhere
you can run virt-manager (ie. a Linux desktop), the key aspects for a
compute GPU are to make sure the IOMMU is enabled on the host
(intel_iommu=on on the host kernel command line... assuming an x86_64
system), blacklist nouveau on the host, just as if you were going to
install the nvidia driver on the host, create and install a VM with
virt-manager, also blacklist nouveau in the VM because you are going to
install the nvidia driver there, then use virt-manager to add the Tesla
to the VM and install the driver and CUDA dev kit.

There are commandline tools to do all this too, virt-install,
virt-viewer, virsh, but virt-manager just makes it easier.

I'm interested in your experience, but I'll be rather surprised if an
NVLink setup "just works", and perhaps a bit dubious about whether it
should just work given the likely lack of isolation in such a mesh
environment.  Thanks,

Alex