[PATCH RFC v2 00/13] IOMMUFD Generic interface

Wed Oct 12 13:50:53 UTC 2022

On 10/12/2022 8:32 AM, Jason Gunthorpe wrote:
> On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote:
>> On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
>>> On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
>>>>> Do we have a solution to this?
>>>>>
>>>>> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
>>>>>
>>>>> Aside from the approach to use the FD, another idea is to just use
>>>>> fork.
>>>>>
>>>>> qemu would do something like
>>>>>
>>>>>  .. stop all container ioctl activity ..
>>>>>  fork()
>>>>>     ioctl(CHANGE_MM) // switch all maps to this mm
>>>>>     .. signal parent.. 
>>>>>     .. wait parent..
>>>>>     exit(0)
>>>>>  .. wait child ..
>>>>>  exec()
>>>>>  ioctl(CHANGE_MM) // switch all maps to this mm
>>>>>  ..signal child..
>>>>>  waitpid(childpid)
>>>>>
>>>>> This way the kernel is never left without a page provider for the
>>>>> maps, the dummy mm_struct belonging to the fork will serve that role
>>>>> for the gap.
>>>>>
>>>>> And the above is only required if we have mdevs, so we could imagine
>>>>> userspace optimizing it away for, eg vfio-pci only cases.
>>>>>
>>>>> It is not as efficient as using a FD backing, but this is super easy
>>>>> to implement in the kernel.
>>>>
>>>> I propose to avoid deadlock for mediated devices as follows.  Currently, an
>>>> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
>>>> is asserted.
>>>>
>>>>   * In vfio_wait, I will maintain a list of waiters, each list element
>>>>     consisting of (task, mdev, close_flag=false).
>>>>
>>>>   * When the vfio device descriptor is closed, vfio_device_fops_release
>>>>     will notify the vfio_iommu driver, which will find the mdev on the waiters
>>>>     list, set elem->close_flag=true, and call wake_up_process for the task.
>>>
>>> This alone is not sufficient, the mdev driver can continue to
>>> establish new mappings until it's close_device function
>>> returns. Killing only existing mappings is racy.
>>>
>>> I think you are focusing on the one issue I pointed at, as I said, I'm
>>> sure there are more ways than just close to abuse this functionality
>>> to deadlock the kernel.
>>>
>>> I continue to prefer we remove it completely and do something more
>>> robust. I suggested two options.
>>
>> It's not racy.  New pin requests also land in vfio_wait if any vaddr's have
>> been invalidated in any vfio_dma in the iommu.  See
>>   vfio_iommu_type1_pin_pages()
>>     if (iommu->vaddr_invalid_count)
>>       vfio_find_dma_valid()
>>         vfio_wait()
> 
> I mean you can't do a one shot wakeup of only existing waiters, and
> you can't corrupt the container to wake up waiters for other devices,
> so I don't see how this can be made to work safely...
> 
> It also doesn't solve any flow that doesn't trigger file close, like a
> process thread being stuck on the wait in the kernel. eg because a
> trapped mmio triggered an access or something.
> 
> So it doesn't seem like a workable direction to me.
> 
>> However, I will investigate saving a reference to the file object in
>> the vfio_dma (for mappings backed by a file) and using that to
>> translate IOVA's.
> 
> It is certainly the best flow, but it may be difficult. Eg the memfd
> work for KVM to do something similar is quite involved.
> 
>> I think that will be easier to use than fork/CHANGE_MM/exec, and may
>> even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR.  To be
>> continued.
> 
> Yes, certainly easier to use, I suggested CHANGE_MM because the kernel
> implementation is very easy, I could send you something to test w/
> iommufd in a few hours effort probably.
> 
> Anyhow, I think this conversation has convinced me there is no way to
> fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to
> it being a security bug, basically.

Please do not.  Please give me the courtesy of time to develop a replacement 
before we delete it. Surely you can make progress on other opens areas of iommufd
without needing to delete this immediately.

- Steve