[vfio-users] cudaErrorDevicesUnavailalbe using Cuda in KVM using VFIO device passthrough

Tue May 1 17:38:00 UTC 2018

On Tue, 1 May 2018 00:37:36 +0000
Andrew Zimmerman <atz at rincon.com> wrote:

> I have a system with 4 Tesla V100-SXM-16GB GPUs in it, and I am

SXM is the NVLink variant, right?  VFIO has no special knowledge or
handling of NVLink and I'd only consider it supported so far as it
behaves similarly to PCIe.  A particular concern of NVLink is how the
mesh nature of the interconnect makes use of and enforces IOMMU-based
translations necessary for device isolation and assignment, but we
can't know this because it's proprietary.  Does NVIDIA claim that VFIO
device assignment is supported for these GPUs?

> attempting to pass these devices through to virtual machines run by
> KVM. I am managing the VMs with OpenNebula and I have followed the
> instructions at
> https://docs.opennebula.org/5.4/deployment/open_cloud_host_setup/pci_passthrough.html
> to pass the device through to my VM. I am able to see the device in
> nvidia-smi, watch its power/temperature levels, change the
> persistence mode and compute mode, etc.

Ugh, official documentation that recommends the vfio-bind script and
manually modifying libvirt's ACL list.  I'd be suspicious of any device
assignment support making use of those sorts of instructions.

> I can query the device to get properties and capabilities, but when I
> try to run a program on it that utilizes the device (beyond
> querying), I receive an error message about the device being
> unavailable. To test, I am using simpleAtopmicIntrinsics out of the
> CUDA Samples. Here is the output I receive:
> 
> SimpleAtomicIntrinsics starting...
> 
> GPU Device 0: "Tesla V100-SXM2-16GB": with compute capability 7.0
> 
> > GPU device has 80 Multi-Processors, SM 7.0 compute capabilities  
> 
> Cuda error at simpleAtomicIntrinsics.cu:108
> code=46(cudaErrorDevicesUnavailable) "cudaMalloc((void **) &dOData,
> memsize)"
> 
> I have tried this with multiple devices (in case there was an issue
> with vfio on the first device) and had the same result on each of
> them.

Have you tried with a PCIe Tesla?

> The host OS is CentOS 7.4.1708. I upgraded the kernel to 4.15.15-1
> from the elrepo to ensure that I had support for vfio_virqfd. I am
> running the NVIDIA 390.15 driver and using cuda 9.1
> (cuda-9-1-9.1.85-1.x86_64 rpm).

vfio_virqfd is just an artifact of OpenNebula's apparent terrible
handling of device assignment.  virqfd is there in a RHEL/Centos 7.4
kernel, but it may not be a separate module and it's not necessary to
load it via dracut as indicated in their guide, only to blacklist
nouveau.

> Does anyone have ideas on what could be causing this or what I could
> try next?

I think you're in uncharted territory with NVLink based GPUs and not
quite standard device assignment support in your chosen distro.  I'd
start with testing whether the test program works with PCIe GPUs to
eliminate the interconnect issue.  Thanks,

Alex