[PATCH RFC v2 00/13] IOMMUFD Generic interface
Jason Gunthorpe
jgg at nvidia.com
Wed Oct 12 12:32:50 UTC 2022
On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote:
> On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
> > On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
> >>> Do we have a solution to this?
> >>>
> >>> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
> >>>
> >>> Aside from the approach to use the FD, another idea is to just use
> >>> fork.
> >>>
> >>> qemu would do something like
> >>>
> >>> .. stop all container ioctl activity ..
> >>> fork()
> >>> ioctl(CHANGE_MM) // switch all maps to this mm
> >>> .. signal parent..
> >>> .. wait parent..
> >>> exit(0)
> >>> .. wait child ..
> >>> exec()
> >>> ioctl(CHANGE_MM) // switch all maps to this mm
> >>> ..signal child..
> >>> waitpid(childpid)
> >>>
> >>> This way the kernel is never left without a page provider for the
> >>> maps, the dummy mm_struct belonging to the fork will serve that role
> >>> for the gap.
> >>>
> >>> And the above is only required if we have mdevs, so we could imagine
> >>> userspace optimizing it away for, eg vfio-pci only cases.
> >>>
> >>> It is not as efficient as using a FD backing, but this is super easy
> >>> to implement in the kernel.
> >>
> >> I propose to avoid deadlock for mediated devices as follows. Currently, an
> >> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
> >> is asserted.
> >>
> >> * In vfio_wait, I will maintain a list of waiters, each list element
> >> consisting of (task, mdev, close_flag=false).
> >>
> >> * When the vfio device descriptor is closed, vfio_device_fops_release
> >> will notify the vfio_iommu driver, which will find the mdev on the waiters
> >> list, set elem->close_flag=true, and call wake_up_process for the task.
> >
> > This alone is not sufficient, the mdev driver can continue to
> > establish new mappings until it's close_device function
> > returns. Killing only existing mappings is racy.
> >
> > I think you are focusing on the one issue I pointed at, as I said, I'm
> > sure there are more ways than just close to abuse this functionality
> > to deadlock the kernel.
> >
> > I continue to prefer we remove it completely and do something more
> > robust. I suggested two options.
>
> It's not racy. New pin requests also land in vfio_wait if any vaddr's have
> been invalidated in any vfio_dma in the iommu. See
> vfio_iommu_type1_pin_pages()
> if (iommu->vaddr_invalid_count)
> vfio_find_dma_valid()
> vfio_wait()
I mean you can't do a one shot wakeup of only existing waiters, and
you can't corrupt the container to wake up waiters for other devices,
so I don't see how this can be made to work safely...
It also doesn't solve any flow that doesn't trigger file close, like a
process thread being stuck on the wait in the kernel. eg because a
trapped mmio triggered an access or something.
So it doesn't seem like a workable direction to me.
> However, I will investigate saving a reference to the file object in
> the vfio_dma (for mappings backed by a file) and using that to
> translate IOVA's.
It is certainly the best flow, but it may be difficult. Eg the memfd
work for KVM to do something similar is quite involved.
> I think that will be easier to use than fork/CHANGE_MM/exec, and may
> even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR. To be
> continued.
Yes, certainly easier to use, I suggested CHANGE_MM because the kernel
implementation is very easy, I could send you something to test w/
iommufd in a few hours effort probably.
Anyhow, I think this conversation has convinced me there is no way to
fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to
it being a security bug, basically.
Jason
More information about the libvir-list
mailing list