Snapshot operation aborted and volume usage

Thu Mar 11 13:24:40 UTC 2021

On Thu, Mar 11, 2021 at 10:51:13 +0200, Liran Rotenberg wrote:
> We recently had this bug[1]. The thought that came from it is the handling
> of error code after running virDomainSnapshotCreateXML, we encountered
> VIR_ERR_OPERATION_ABORTED(78).

VIR_ERR_OPERATION_ABORTED is an error code which is emitted by the
migration code only. That means that the error comes from the failure to
take a memory image/snapshot of the VM.

Quick skim through the bugreport seems to mention timeout, so your code
probably aborted the snapshot if it was taking too long.

> Apparently, the new volume is in use. Are there cases where this will
> happen and the new volume won't appear in the volumes chain? Can we detect
> / know when?

In the vast majority of cases if virDomainSnapshotCreateXML returns
failure the new disk volumes are NOT used at that point.

Libvirt tries very hard to ensure that everything is atomic. The memory
snapshot is taken before installing volumes into the backing chain, so
if that one fails we don't even attempt to do anything with the disks.

There are three extremely unlikely reasons where the snapshot API returns
failure and new images were already installed into the backing chain:

1) resuming of the VM failed after snapshot
2) thawing (domfsthaw) of filesystems has failed
    (easily avoided by not using the _QUIESCE flag, but freezing
    manually)
3) saving of the internal VM state XML failed

Any error except those above can happen only if the images werent
installed or the VM died while installing the images.

In addition if resuming the cpus after the snapshot fails, the cpus
didn't run so the guest couldn't have written anything to the image.
Since snapshot is supposed to flush qemu caches, in case you destroy the
VM without running the vcpus it's safe to discard the overlays as guest
didn't write anything into them yet.

> Thinking aloud, if we can detect such cases we can prevent rolling back by
> reporting it back from VDSM to ovirt. Or, if it can't be detected to go on
> the safe side in order to save data corruption and prevent the rollback as
> well.

In general, except for the case when saving of the guest XML has failed,
the new disk images will not be used by the VM so it's safe to delete
them.

> Currently, in ovirt, if the job is aborted, we will look into the chain to
> decide whether to rollback or not.

This is okay, we update the XML only if qemu successfully installed the
overlays.