[Libguestfs] [PATCH] v2v: rhv-upload-plugin: Fix waiting for finalize

Vojtech Juranek vjuranek at redhat.com
Fri Jul 16 08:18:24 UTC 2021


One issue in documentation, otherwise LGTM

On Monday, 12 July 2021 01:58:52 CEST Nir Soffer wrote:
> Waiting for image transfer finalize is complex. In the past we tried to
> simplify the process by waiting on the disk status, but turns out that
> due to the way oVirt lock the disk, this is not reliable.
> 
> These is finalize success flow:
> 
> 1. User asks to finalize the transfer
> 2. oVirt sets transfer phase to FINALIZING_SUCCESS
> 3. oVirt finalizes the transfer
> 4. oVirt sets disk status to OK
> 5. oVirt unlocks the disk and changes transfer phase to FINISHED_SUCCESS
> 6. oVirt removes the transfer
> 
> In oVirt logs we can see that disk status changes to OK about 3 seconds
> before the disk is actually unlocked. This is a very old problem that is
> unlikely to be fixed.
> 
> The only way to wait for transfer completion is to poll the transfer
> phase, but oVirt makes this hard by removing the transfer shortly after
> it completes, so we may not be able to get the FINISHED_SUCCESS phase.
> If the transfer was removed before we got one of the final phases, we
> need to check the disk status to understand the result of transfer.
> 
> oVirt 4.4.7 made polling transfer phase easier by keeping the transfer
> after completion, but we need to support older versions so we must have
> generic code that work with any version.
> 
> To make debugging easier, we log the transfer phase during polling. Here
> is a typical transfer log when finalizing transfer:
> 
>     finalizing transfer 59e545f3-db1f-4a6b-90b1-80ac66572faa
>     transfer 59e545f3-db1f-4a6b-90b1-80ac66572faa is finalizing_success
>     transfer 59e545f3-db1f-4a6b-90b1-80ac66572faa is finalizing_success
>     transfer 59e545f3-db1f-4a6b-90b1-80ac66572faa is finalizing_success
>     transfer 59e545f3-db1f-4a6b-90b1-80ac66572faa is finalizing_success
>     transfer 59e545f3-db1f-4a6b-90b1-80ac66572faa is finished_success
>     transfer 59e545f3-db1f-4a6b-90b1-80ac66572faa finalized in 5.153 seconds
> 
> Signed-off-by: Nir Soffer <nsoffer at redhat.com>
> ---
>  v2v/rhv-upload-plugin.py | 102 +++++++++++++++++++++++++++------------
>  1 file changed, 71 insertions(+), 31 deletions(-)
> 
> diff --git a/v2v/rhv-upload-plugin.py b/v2v/rhv-upload-plugin.py
> index 07e879c9..11050358 100644
> --- a/v2v/rhv-upload-plugin.py
> +++ b/v2v/rhv-upload-plugin.py
> @@ -607,17 +607,29 @@ def finalize_transfer(connection, transfer, disk_id):
>      """
>      Finalize a transfer, making the transfer disk available.
> 
> -    If finalizing succeeds, transfer's phase will change to
> FINISHED_SUCCESS -    and the transer's disk status will change to OK.  On
> errors, the transfer's -    phase will change to FINISHED_FAILURE and the
> disk status will change to -    ILLEGAL and it will be removed. In both
> cases the transfer entity will be -    removed shortly after.
> -
> -    If oVirt fails to finalize the transfer, transfer's phase will change
> to -    PAUSED_SYSTEM. In this case the disk's status will change to
> ILLEGAL and it -    will not be removed.
> -
> -    For simplicity, we track only disk's status changes.
> +    If finalizing succeeds, the transfer's disk status will change to OK
> +    and transfer's phase will change to FINISHED_SUCCESS. Unfortunately,
> +    the disk status is modified before the transfer finishes, and oVirt
> +    may still hold a lock on the disk at this point.
> +
> +    The only way to make sure that the disk is unlocked, is to wait
> +    until the transfer phase switches FINISHED_SUCCESS. Unfortunately
> +    oVirt makes this hard to use because the transfer is removed shortly
> +    after switching the phase to the final phase. However if the
> +    transfer was removed, we can be sure that the disk is not unlocked,

should be "disk is not locked"

> +    since oVirt releases the locks before removing the transfer.
> +
> +    On errors, the transfer's phase will change to FINISHED_FAILURE and
> +    the disk status will change to ILLEGAL and it will be removed. Again
> +    the transfer will be removed shortly after that.
> +
> +    If oVirt fails to finalize the transfer, transfer's phase will
> +    change to PAUSED_SYSTEM. In this case the disk's status will change
> +    to ILLEGAL and it will not be removed.
> +
> +    oVirt 4.4.7 made waiting for transfer easier by keeping transfers
> +    after they complete, but we must support older versions so we have
> +    generic code that work with any version.
> 
>      For more info see:
>      -
> http://ovirt.github.io/ovirt-engine-api-model/4.4/#services/image_transfer
> @@ -632,34 +644,62 @@ def finalize_transfer(connection, transfer, disk_id):
> 
>      transfer_service.finalize()
> 
> -    disk_service = (connection.system_service()
> -                    .disks_service()
> -                    .disk_service(disk_id))
> -
>      while True:
>          time.sleep(1)
>          try:
> -            disk = disk_service.get()
> +            transfer = transfer_service.get()
>          except sdk.NotFoundError:
> -            # Disk verification failed and the system removed the disk.
> -            raise RuntimeError(
> -                "transfer %s failed: disk %s was removed"
> -                % (transfer.id, disk_id))
> +            # Transfer was removed (ovirt < 4.4.7). We need to check the
> +            # disk status to understand if the transfer was successful.
> +            # Due to the way oVirt does locking, we know that the disk
> +            # is unlocked at this point so we can check only once.
> 
> -        if disk.status == types.DiskStatus.ILLEGAL:
> -            # Disk verification failed or transfer was paused by the
> system. -            raise RuntimeError(
> -                "transfer %s failed: disk is ILLEGAL" % transfer.id)
> +            debug("transfer %s was removed, checking disk %s status"
> +                  % (transfer.id, disk_id))
> 
> -        if disk.status == types.DiskStatus.OK:
> -            debug("transfer %s finalized in %.3f seconds"
> -                  % (transfer.id, time.time() - start))
> -            break
> +            disk_service = (connection.system_service()
> +                            .disks_service()
> +                            .disk_service(disk_id))
> +
> +            try:
> +                disk = disk_service.get()
> +            except sdk.NotFoundError:
> +                raise RuntimeError(
> +                    "transfer %s failed: disk %s was removed"
> +                    % (transfer.id, disk_id))
> +
> +            debug("disk %s is %s" % (disk_id, disk.status))
> +
> +            if disk.status == types.DiskStatus.OK:
> +                break
> 
> -        if time.time() > start + timeout:
>              raise RuntimeError(
> -                "timed out waiting for transfer %s to finalize"
> -                % transfer.id)
> +                "transfer %s failed: disk is %s" % (transfer.id,
> disk.status)) +        else:
> +            # Transfer exists, check if it reached one of the final
> +            # phases, or we timed out.
> +
> +            debug("transfer %s is %s" % (transfer.id, transfer.phase))
> +
> +            if transfer.phase == types.ImageTransferPhase.FINISHED_SUCCESS:
> +                break
> +
> +            if transfer.phase == types.ImageTransferPhase.FINISHED_FAILURE:
> +                raise RuntimeError(
> +                    "transfer %s has failed" % (transfer.id,))
> +
> +            if transfer.phase == types.ImageTransferPhase.PAUSED_SYSTEM:
> +                raise RuntimeError(
> +                    "transfer %s was paused by system" % (transfer.id,))
> +
> +            if time.time() > start + timeout:
> +                raise RuntimeError(
> +                    "timed out waiting for transfer %s to finalize, "
> +                    "transfer is %s"
> +                    % (transfer.id, transfer.phase))
> +
> +    debug("transfer %s finalized in %.3f seconds"
> +          % (transfer.id, time.time() - start))
> 
> 
>  def transfer_supports_format():

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part.
URL: <http://listman.redhat.com/archives/libguestfs/attachments/20210716/e72a8093/attachment.sig>


More information about the Libguestfs mailing list