[RFC 00/18] vfio: Adopt iommufd

Tue Apr 26 19:36:17 UTC 2022

On Tue, Apr 26, 2022 at 01:24:35PM -0600, Alex Williamson wrote:
> On Tue, 26 Apr 2022 13:42:17 -0300
> Jason Gunthorpe <jgg at nvidia.com> wrote:
> 
> > On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > > We also need to be able to advise libvirt as to how each iommufd object
> > > or user of that object factors into the VM locked memory requirement.
> > > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > > to set the locked memory limit to the size of VM RAM per iommufd,
> > > regardless of the number of devices using a given iommufd.  However, I
> > > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > > Combinations of devices where some map VM RAM and others map QEMU
> > > buffer space could still require some incremental increase per device
> > > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > > will still be involved even after iommufd solves the per-device
> > > vfio-pci locked memory limit issue.  Thanks,  
> > 
> > If the model is to pass the FD, how about we put a limit on the FD
> > itself instead of abusing the locked memory limit?
> > 
> > We could have a no-way-out ioctl that directly limits the # of PFNs
> > covered by iopt_pages inside an iommufd.
> 
> FD passing would likely only be the standard for libvirt invoked VMs.
> The QEMU vfio-pci device would still parse a host= or sysfsdev= option
> when invoked by mortals and associate to use the legacy vfio group
> interface or the new vfio device interface based on whether an iommufd
> is specified.

Yes, but perhaps we don't need resource limits in the mortals case..

> Does that rule out your suggestion?  I don't know, please reveal more
> about the mechanics of putting a limit on the FD itself and this
> no-way-out ioctl.  The latter name suggests to me that I should also
> note that we need to support memory hotplug with these devices.  Thanks,

So libvirt uses CAP_SYS_RESOURCE and prlimit to adjust things in
realtime today?

It could still work, instead of no way out iommufd would have to check
for CAP_SYS_RESOURCE to make the limit higher.

It is a pretty simple idea, we just attach a resource limit to the FD
and every PFN that gets mapped into the iommufd counts against that
limit, regardless if it is pinned or not. An ioctl on the FD would set
the limit, defaulting to unlimited.

To me this has the appeal that what is being resourced controlled is
strictly defined - address space mapped into an iommufd - which has a
bunch of nice additional consequences like partially bounding the
amount of kernel memory an iommufd can consume and so forth.

Doesn't interact with iouring or rdma however.

Though we could certianly consider allowing RDMA to consume an iommufd
to access pinned pages much like a vfio-mdev would - I'm not sure what
is ideal for the qemu usage of RDMA for migration..

Jason