[Virtio-fs] [PATCH v2 2/4] vhost-user: Interface for migration state transfer

Fri Jul 21 08:23:23 UTC 2023

On 7/21/23 16:07, Hanna Czenczek wrote:
> On 20.07.23 17:05, Hao Xu wrote:
>>
>> On 7/20/23 21:20, Hanna Czenczek wrote:
>>> On 20.07.23 14:13, Hao Xu wrote:
>>>>
>>>> On 7/12/23 19:17, Hanna Czenczek wrote:
>>>>> Add the interface for transferring the back-end's state during 
>>>>> migration
>>>>> as defined previously in vhost-user.rst.
>>>>>
>>>>> Signed-off-by: Hanna Czenczek <hreitz at redhat.com>
>>>>> ---
>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>   hw/virtio/vhost-user.c            | 147 
>>>>> ++++++++++++++++++++++++++++++
>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
>>>>>   4 files changed, 287 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/virtio/vhost-backend.h 
>>>>> b/include/hw/virtio/vhost-backend.h
>>>>> index 31a251a9f5..e59d0b53f8 100644
>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>   } VhostSetConfigType;
>>>>>   +typedef enum VhostDeviceStateDirection {
>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>> +} VhostDeviceStateDirection;
>>>>> +
>>>>> +typedef enum VhostDeviceStatePhase {
>>>>> +    /* The device (and all its vrings) is stopped */
>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>> +} VhostDeviceStatePhase;
>>>>> +
>>>>>   struct vhost_inflight;
>>>>>   struct vhost_dev;
>>>>>   struct vhost_log;
>>>>> @@ -133,6 +145,15 @@ typedef int 
>>>>> (*vhost_set_config_call_op)(struct vhost_dev *dev,
>>>>>     typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
>>>>>   +typedef bool (*vhost_supports_migratory_state_op)(struct 
>>>>> vhost_dev *dev);
>>>>> +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
>>>>> + VhostDeviceStateDirection direction,
>>>>> + VhostDeviceStatePhase phase,
>>>>> +                                            int fd,
>>>>> +                                            int *reply_fd,
>>>>> +                                            Error **errp);
>>>>> +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, 
>>>>> Error **errp);
>>>>> +
>>>>>   typedef struct VhostOps {
>>>>>       VhostBackendType backend_type;
>>>>>       vhost_backend_init vhost_backend_init;
>>>>> @@ -181,6 +202,9 @@ typedef struct VhostOps {
>>>>>       vhost_force_iommu_op vhost_force_iommu;
>>>>>       vhost_set_config_call_op vhost_set_config_call;
>>>>>       vhost_reset_status_op vhost_reset_status;
>>>>> +    vhost_supports_migratory_state_op 
>>>>> vhost_supports_migratory_state;
>>>>> +    vhost_set_device_state_fd_op vhost_set_device_state_fd;
>>>>> +    vhost_check_device_state_op vhost_check_device_state;
>>>>>   } VhostOps;
>>>>>     int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
>>>>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>>>>> index 69bf59d630..d8877496e5 100644
>>>>> --- a/include/hw/virtio/vhost.h
>>>>> +++ b/include/hw/virtio/vhost.h
>>>>> @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev 
>>>>> *dev,
>>>>>   int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t 
>>>>> queue_size,
>>>>>                              struct vhost_inflight *inflight);
>>>>>   bool vhost_dev_has_iommu(struct vhost_dev *dev);
>>>>> +
>>>>> +/**
>>>>> + * vhost_supports_migratory_state(): Checks whether the back-end
>>>>> + * supports transferring internal state for the purpose of 
>>>>> migration.
>>>>> + * Support for this feature is required for 
>>>>> vhost_set_device_state_fd()
>>>>> + * and vhost_check_device_state().
>>>>> + *
>>>>> + * @dev: The vhost device
>>>>> + *
>>>>> + * Returns true if the device supports these commands, and false 
>>>>> if it
>>>>> + * does not.
>>>>> + */
>>>>> +bool vhost_supports_migratory_state(struct vhost_dev *dev);
>>>>> +
>>>>> +/**
>>>>> + * vhost_set_device_state_fd(): Begin transfer of internal state 
>>>>> from/to
>>>>> + * the back-end for the purpose of migration.  Data is to be 
>>>>> transferred
>>>>> + * over a pipe according to @direction and @phase.  The sending 
>>>>> end must
>>>>> + * only write to the pipe, and the receiving end must only read 
>>>>> from it.
>>>>> + * Once the sending end is done, it closes its FD.  The receiving 
>>>>> end
>>>>> + * must take this as the end-of-transfer signal and close its FD, 
>>>>> too.
>>>>> + *
>>>>> + * @fd is the back-end's end of the pipe: The write FD for SAVE, 
>>>>> and the
>>>>> + * read FD for LOAD.  This function transfers ownership of @fd to 
>>>>> the
>>>>> + * back-end, i.e. closes it in the front-end.
>>>>> + *
>>>>> + * The back-end may optionally reply with an FD of its own, if this
>>>>> + * improves efficiency on its end.  In this case, the returned FD is
>>>>
>>>>
>>>> Hi Hanna,
>>>>
>>>> In what case/situation, the back-end will have a more efficient fd?
>>>
>>> Hi Hao,
>>>
>>> There is no example yet.
>>>
>>>> Here my understanding of this "FD of its own" is as same type as
>>>>
>>>> the given fd(e.g. both pipe files), why the fd from back-end makes
>>>>
>>>> difference? Do I miss anything here?
>>>
>>> Maybe it makes more sense in the context of how we came up with the 
>>> idea: Specifically, Stefan and me were asking which end should 
>>> provide the FD.  In the context of vhost-user, it makes sense to 
>>> have it be the front-end, because it controls vhost-user 
>>> communication, but that’s just the natural protocol choice, not 
>>> necessarily the most efficient one.
>>>
>>> It is imaginable that the front-end (e.g. qemu) could create a file 
>>> descriptor whose data is directly spliced (automatically) into the 
>>> migration stream, and then hand this FD to the back-end. In 
>>> practice, this doesn’t work for qemu (at this point), because it 
>>> doesn’t use simple read/write into the migration stream, but has an 
>>> abstraction layer for that.  It might be possible to make this work 
>>> in some cases, depending on what is used as a transport, but (1) not 
>>> generally, and (2) not now.  But this would be efficient.
>>
>>
>> I'm thinking one thing, we now already have a channel(a unix domain 
>> socket) between front-end and back-end, why not delivering the state 
>> file (as an fd) to/from back-end directly rathen than negotiating
>>
>> a new pipe as data channel.(since we already assume they can share fd 
>> of pipe, why not fd of normal file?)
>
> I don’t quite understand.  We do deliver the FD to the back-end 
> through this new command to the back-end, rather directly.  The 
> back-end can disagree and send another FD back, but it doesn’t have 
> to.  It can just simply always take the FD that it’s been given.
>
> The FD also doesn’t need to be a pipe, it can be any FD that the 
> back-end can read the state from (see my discussion with Stefan on 
> patch 1).  In the practical implementation, it’s difficult for qemu to 
> provide anything that is not a pipe, because the migration state can 
> be sent over many channels, and the respective FD (if any) is not 
> exposed on the high level (for devices).
>
>>
>>>
>>> The model we’d implement in qemu with this series is comparatively 
>>> not efficient, because it manually copies data from the FD (which by 
>>> default is a pipe) into the migration stream.
>>>
>>> But it is possible that the front-end can provide a zero-copy FD 
>>> into the migration stream, and for that reason we want to allow the 
>>> front-end to provide the transfer FD.
>>>
>>> In contrast, the back-end might have a more efficient implementation 
>>> on its side, too, though.  It is difficult to imagine, but it may be 
>>> possible that it has an FD already where the data needs to written 
>>> to/read from, e.g. because it’s connected to a physical device that 
>>> wants to get/send its state this way.  Admittedly, we have 
>>> absolutely no concrete example for such a back-end at this point, 
>>> but it’s hard to rule out that it is possible that there will be 
>>> back-ends that could make use of zero-copy if only they are allowed 
>>> to dictate the transfer FD.
>>>
>>> So because we in qemu can’t (at least not generally) provide an 
>>> efficient (zero-copy) implementation, we don’t want to rule out that 
>>> the back-end might be able to, so we also want to allow it to 
>>> provide the transfer FD.
>>>
>>> In the end, we decided that we don’t want to preclude either side of 
>>> providing the FD.  If one side knows it can do better than a plain 
>>> pipe with copying on both ends, it should provide the FD. That 
>>> doesn’t complicate the implementation much.
>>>
>>> (So, notably, we measure “improves efficiency” based on “is it 
>>> better than a plain pipe with copying on both ends”.  A pipe with 
>>> copying is the default implementation, but as Stefan has pointed out 
>>> in his review, it doesn’t need to be a pipe.  More efficient FDs, 
>>> like the back-end can provide in its reply, would actually likely 
>>> not be pipes.)
>>
>>
>> Yea, but the vhost-user protocol in this patchset defines these FDs 
>> as data transfer channel not the data itself, how about the latter?
>
> I don’t quite understand this either.  Perhaps you’re suggesting that 
> we should just transfer the data over the vhost-user socket. There are 
> two reasons why I don’t think we should do that: First, the vhost-user 
> socket is currently only used for a small amount of data.  For 
> example, the Rust vhost crate will reject any message longer than 4096 
> bytes.  So if I were to use it for more data, I feel like I’d be 
> abusing it.  Second, if/when we add other transfer phases (i.e. while 
> the VM is still running, to transfer larger amounts of state), it will 
> be important that the vhost-user socket is open for messages from the 
> front-end at any time.  So then, at the latest, would we need a 
> separate data channel.  I don’t feel like there is harm in making the 
> interface have this separate channel now already.
>
> Or maybe you suggest that the separate FD is OK, but it shouldn’t be 
> defined to be a data transfer channel, but just data, i.e. not 
> necessarily a pipe.  I think that’s basically the discussion I had 
> with Stefan on patch 1: That the important thing is just that the 
> receiving end can read data out of the FD until EOF, and the writing 
> end can write data and close the FD to signal the end of data. Whether 
> they actually do that is not important, only that anything they do 
> will let the other end do that.

I re-read the discussion of patch1, I now see what I suggested is 
already included in the protocol design.

>
> Hanna
>