[libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

Jason Wang jasowang at redhat.com
Fri Dec 6 09:40:02 UTC 2019


On 2019/12/6 下午4:22, Yan Zhao wrote:
> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
>> On 2019/12/5 下午4:51, Yan Zhao wrote:
>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
>>>>> For SRIOV devices, VFs are passthroughed into guest directly without host
>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
>>>>> dynamic host mediation is required to  (1) get device states, (2) get
>>>>> dirty pages. Since device states as well as other critical information
>>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
>>>>> it is handy to provide an extension in PF driver to centralizingly control
>>>>> VFs' migration.
>>>>>
>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
>>>>> dynamically trap VFs' bars for dirty page tracking and
>>>> A silly question, what's the reason for doing this, is this a must for dirty
>>>> page tracking?
>>>>
>>> For performance consideration. VFs' bars should be passthoughed at
>>> normal time and only enter into trap state on need.
>>
>> Right, but how does this matter for the case of dirty page tracking?
>>
> Take NIC as an example, to trap its VF dirty pages, software way is
> required to trap every write of ring tail that resides in BAR0.


Interesting, but it looks like we need:
- decode the instruction
- mediate all access to BAR0
All of which seems a great burden for the VF driver. I wonder whether or 
not doing interrupt relay and tracking head is better in this case.


>   There's
> still no IOMMU Dirty bit available.
>>>>>     (3) centralizing
>>>>> VF critical states retrieving and VF controls into one driver, we propose
>>>>> to introduce mediate ops on top of current vfio-pci device driver.
>>>>>
>>>>>
>>>>>                                       _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>>>>>     __________   register mediate ops|  ___________     ___________    |
>>>>> |          |<-----------------------|     VF    |   |           |
>>>>> | vfio-pci |                      | |  mediate  |   | PF driver |   |
>>>>> |__________|----------------------->|   driver  |   |___________|
>>>>>         |            open(pdev)      |  -----------          |         |
>>>>>         |                                                    |
>>>>>         |                            |_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
>>>>>        \|/                                                  \|/
>>>>> -----------                                         ------------
>>>>> |    VF   |                                         |    PF    |
>>>>> -----------                                         ------------
>>>>>
>>>>>
>>>>> VF mediate driver could be a standalone driver that does not bind to
>>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
>>>>> extension of PF driver (as in patches 7-9) .
>>>>>
>>>>> Rather than directly bind to VF, VF mediate driver register a mediate
>>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
>>>>> mediate ops.
>>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
>>>>> before vfio-pci binding to any devices. And VF mediate driver can
>>>>> support mediating multiple devices.)
>>>>>
>>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
>>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
>>>>> device as a parameter.
>>>>> VF mediate driver should return success or failure depending on it
>>>>> supports the pdev or not.
>>>>> E.g. VF mediate driver would compare its supported VF devfn with the
>>>>> devfn of the passed-in pdev.
>>>>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
>>>>> stop querying other mediate ops and bind the opening device with this
>>>>> mediate ops using the returned mediate handle.
>>>>>
>>>>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
>>>>> VF will be intercepted into VF mediate driver as
>>>>> vfio_pci_mediate_ops->get_region_info(),
>>>>> vfio_pci_mediate_ops->rw,
>>>>> vfio_pci_mediate_ops->mmap, and get customized.
>>>>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
>>>>> further return 'pt' to indicate whether vfio-pci should further
>>>>> passthrough data to hw.
>>>>>
>>>>> when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
>>>>> with a mediate handle as parameter.
>>>>>
>>>>> The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
>>>>> mediate driver be able to differentiate two opening VFs of the same device
>>>>> id and vendor id.
>>>>>
>>>>> When VF mediate driver exits, it unregisters its mediate ops from
>>>>> vfio-pci.
>>>>>
>>>>>
>>>>> In this patchset, we enable vfio-pci to provide 3 things:
>>>>> (1) calling mediate ops to allow vendor driver customizing default
>>>>> region info/rw/mmap of a region.
>>>>> (2) provide a migration region to support migration
>>>> What's the benefit of introducing a region? It looks to me we don't expect
>>>> the region to be accessed directly from guest. Could we simply extend device
>>>> fd ioctl for doing such things?
>>>>
>>> You may take a look on mdev live migration discussions in
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-11/msg01763.html
>>>
>>> or previous discussion at
>>> https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg04908.html,
>>> which has kernel side implemetation https://patchwork.freedesktop.org/series/56876/
>>>
>>> generaly speaking, qemu part of live migration is consistent for
>>> vfio-pci + mediate ops way or mdev way.
>>
>> So in mdev, do you still have a mediate driver? Or you expect the parent
>> to implement the region?
>>
> No, currently it's only for vfio-pci.

And specific to PCI.

> mdev parent driver is free to customize its regions and hence does not
> requires this mediate ops hooks.
>
>>> The region is only a channel for
>>> QEMU and kernel to communicate information without introducing IOCTLs.
>>
>> Well, at least you introduce new type of region in uapi. So this does
>> not answer why region is better than ioctl. If the region will only be
>> used by qemu, using ioctl is much more easier and straightforward.
>>
> It's not introduced by me :)
> mdev live migration is actually using this way, I'm just keeping
> compatible to the uapi.


I meant e.g VFIO_REGION_TYPE_MIGRATION.


>
>  From my own perspective, my answer is that a region is more flexible
> compared to ioctl. vendor driver can freely define the size,
>

Probably not since it's an ABI I think.

>   mmap cap of
> its data subregion.
>

It doesn't help much unless it can be mapped into guest (which I don't 
think it was the case here).

>   Also, there're already too many ioctls in vfio.

Probably not :) We had a brunch of  subsystems that have much more 
ioctls than VFIO. (e.g DRM)

>>>
>>>>> (3) provide a dynamic trap bar info region to allow vendor driver
>>>>> control trap/untrap of device pci bars
>>>>>
>>>>> This vfio-pci + mediate ops way differs from mdev way in that
>>>>> (1) medv way needs to create a 1:1 mdev device on top of one VF, device
>>>>> specific mdev parent driver is bound to VF directly.
>>>>> (2) vfio-pci + mediate ops way does not create mdev devices and VF
>>>>> mediate driver does not bind to VFs. Instead, vfio-pci binds to VFs.
>>>>>
>>>>> The reason why we don't choose the way of writing mdev parent driver is
>>>>> that
>>>>> (1) VFs are almost all the time directly passthroughed. Directly binding
>>>>> to vfio-pci can make most of the code shared/reused.
>>>> Can we split out the common parts from vfio-pci?
>>>>
>>> That's very attractive. but one cannot implement a vfio-pci except
>>> export everything in it as common part :)
>>
>> Well, I think there should be not hard to do that. E..g you can route it
>> back to like:
>>
>> vfio -> vfio_mdev -> parent -> vfio_pci
>>
> it's desired for us to have mediate driver binding to PF device.
> so once a VF device is created, only PF driver and vfio-pci are
> required. Just the same as what needs to be done for a normal VF passthrough.
> otherwise, a separate parent driver binding to VF is required.
> Also, this parent driver has many drawbacks as I mentions in this
> cover-letter.

Well, as discussed, no need to duplicate the code, bar trick should 
still work. The main issues I saw with this proposal is:

1) PCI specific, other bus may need something similar
2) Function duplicated with mdev and mdev can do even more


>>>>>     If we write a
>>>>> vendor specific mdev parent driver, most of the code (like passthrough
>>>>> style of rw/mmap) still needs to be copied from vfio-pci driver, which is
>>>>> actually a duplicated and tedious work.
>>>> The mediate ops looks quite similar to what vfio-mdev did. And it looks to
>>>> me we need to consider live migration for mdev as well. In that case, do we
>>>> still expect mediate ops through VFIO directly?
>>>>
>>>>
>>>>> (2) For features like dynamically trap/untrap pci bars, if they are in
>>>>> vfio-pci, they can be available to most people without repeated code
>>>>> copying and re-testing.
>>>>> (3) with a 1:1 mdev driver which passthrough VFs most of the time, people
>>>>> have to decide whether to bind VFs to vfio-pci or mdev parent driver before
>>>>> it runs into a real migration need. However, if vfio-pci is bound
>>>>> initially, they have no chance to do live migration when there's a need
>>>>> later.
>>>> We can teach management layer to do this.
>>>>
>>> No. not possible as vfio-pci by default has no migration region and
>>> dirty page tracking needs vendor's mediation at least for most
>>> passthrough devices now.
>>
>> I'm not quite sure I get here but in this case, just tech them to use
>> the driver that has migration support?
>>
> That's a way, but as more and more passthrough devices have demands and
> caps to do migration, will vfio-pci be used in future any more ?


This should not be a problem:
- If we introduce a common mdev for vfio-pci, we can just bind that 
driver always
- The most straightforward way to support dirty page tracking is done by 
IOMMU instead of device specific operations.

Thanks

>
> Thanks
> Yan
>
>> Thanks
>>
>>
>>> Thanks
>>> Yn
>>>
>>>> Thanks
>>>>
>>>>
>>>>> In this patchset,
>>>>> - patches 1-4 enable vfio-pci to call mediate ops registered by vendor
>>>>>      driver to mediate/customize region info/rw/mmap.
>>>>>
>>>>> - patches 5-6 provide a standalone sample driver to register a mediate ops
>>>>>      for Intel Graphics Devices. It does not bind to IGDs directly but decides
>>>>>      what devices it supports via its pciidlist. It also demonstrates how to
>>>>>      dynamic trap a device's PCI bars. (by adding more pciids in its
>>>>>      pciidlist, this sample driver actually is not necessarily limited to
>>>>>      support IGDs)
>>>>>
>>>>> - patch 7-9 provide a sample on i40e driver that supports Intel(R)
>>>>>      Ethernet Controller XL710 Family of devices. It supports VF precopy live
>>>>>      migration on Intel's 710 SRIOV. (but we commented out the real
>>>>>      implementation of dirty page tracking and device state retrieving part
>>>>>      to focus on demonstrating framework part. Will send out them in future
>>>>>      versions)
>>>>>      patch 7 registers/unregisters VF mediate ops when PF driver
>>>>>      probes/removes. It specifies its supporting VFs via
>>>>>      vfio_pci_mediate_ops->open(pdev)
>>>>>
>>>>>      patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
>>>>>      provides a sample implementation of migration region.
>>>>>      The QEMU part of vfio migration is based on v8
>>>>>      https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
>>>>>      We do not based on recent v9 because we think there are still opens in
>>>>>      dirty page track part in that series.
>>>>>
>>>>>      patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
>>>>>      provides an example on how to trap part of bar0 when migration starts
>>>>>      and passthrough this part of bar0 again when migration fails.
>>>>>
>>>>> Yan Zhao (9):
>>>>>      vfio/pci: introduce mediate ops to intercept vfio-pci ops
>>>>>      vfio/pci: test existence before calling region->ops
>>>>>      vfio/pci: register a default migration region
>>>>>      vfio-pci: register default dynamic-trap-bar-info region
>>>>>      samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
>>>>>      sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
>>>>>      i40e/vf_migration: register mediate_ops to vfio-pci
>>>>>      i40e/vf_migration: mediate migration region
>>>>>      i40e/vf_migration: support dynamic trap of bar0
>>>>>
>>>>>     drivers/net/ethernet/intel/Kconfig            |   2 +-
>>>>>     drivers/net/ethernet/intel/i40e/Makefile      |   3 +-
>>>>>     drivers/net/ethernet/intel/i40e/i40e.h        |   2 +
>>>>>     drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>>>>     .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++++++++++++++++++
>>>>>     .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
>>>>>     drivers/vfio/pci/vfio_pci.c                   | 189 +++++-
>>>>>     drivers/vfio/pci/vfio_pci_private.h           |   2 +
>>>>>     include/linux/vfio.h                          |  18 +
>>>>>     include/uapi/linux/vfio.h                     | 160 +++++
>>>>>     samples/Kconfig                               |   6 +
>>>>>     samples/Makefile                              |   1 +
>>>>>     samples/vfio-pci/Makefile                     |   2 +
>>>>>     samples/vfio-pci/igd_dt.c                     | 367 ++++++++++
>>>>>     14 files changed, 1455 insertions(+), 4 deletions(-)
>>>>>     create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
>>>>>     create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
>>>>>     create mode 100644 samples/vfio-pci/Makefile
>>>>>     create mode 100644 samples/vfio-pci/igd_dt.c
>>>>>




More information about the libvir-list mailing list