[libvirt] [PATCH v2 3/9] backup: Document nuances between different state capture APIs

Fri Oct 12 20:25:05 UTC 2018

On 10/12/18 3:45 AM, Peter Krempa wrote:
> On Fri, Oct 12, 2018 at 00:10:05 -0500, Eric Blake wrote:
>> Upcoming patches will add support for incremental backups via
>> a new API; but first, we need a landing page that gives an
>> overview of capturing various pieces of guest state, and which
>> APIs are best suited to which tasks.
>>
>> Signed-off-by: Eric Blake <eblake at redhat.com>
>>

>> +
>> +    <h2><a id="definitions">State capture trade-offs</a></h2>
>> +
>> +    <p>One of the features made possible with virtual machines is live
>> +      migration -- transferring all state related to the guest from
>> +      one host to another with minimal interruption to the guest's
>> +      activity. In this case, state includes domain memory (including
>> +      register and device contents), and domain storage (whether the
>> +      guest's view of the disks are backed by local storage on the
>> +      host, or by the hypervisor accessing shared storage over a
>> +      network).  A clever observer will then note that if all state is
>> +      available for live migration, then there is nothing stopping a
>> +      user from saving some or all of that state at a given point of
>> +      time in order to be able to later rewind guest execution back to
>> +      the state it previously had. The astute reader will also realize
>> +      that state capture at any level requires that the data must be
>> +      stored and managed by some mechanism. This processing might fit
>> +      in a single file, or more likely require a chain of related
>> +      files, and may require synchronization with third-party tools

I'll update this wording to 'array of files', matching your later comment.

>> +      built around managing the amount of data resulting from
>> +      capturing the state of multiple guests that each use multiple
>> +      disks.
>> +    </p>
>> +
>> +    <p>
>> +      There are several libvirt APIs associated with capturing the
>> +      state of a guest, which can later be used to rewind that guest
>> +      to the conditions it was in earlier.  The following is a list of
>> +      trade-offs and differences between the various facets that
>> +      affect capturing domain state for active domains:
>> +    </p>
>> +
>> +    <dl>
>> +      <dt>Duration</dt>
>> +      <dd>Capturing state can be a lengthy process, so while the
>> +        captured state ideally represents an atomic point in time
>> +        correpsonding to something the guest was actually executing,
>> +        capturing state tends to focus on minimizing guest downtime
>> +        while performing the rest of the state capture in parallel
>> +        with guest execution.  Some interfaces require up-front
>> +        preparation (the state captured is not complete until the API
>> +        ends, which may be some time after the command was first
>> +        started), while other interfaces track the state when the
>> +        command was first issued, regardless of the time spent in
>> +        capturing the rest of the state.  Also, time spent in state
>> +        capture may be longer than the time required for live
>> +        migration, when state must be duplicated rather than shared.
>> +      </dd>
>> +
>> +      <dt>Amount of state</dt>
>> +      <dd>For an online guest, there is a choice between capturing the
>> +        guest's memory (all that is needed during live migration when
>> +        the storage is already shared between source and destination),
>> +        the guest's disk state (all that is needed if there are no
>> +        pending guest I/O transactions that would be lost without the
>> +        corresponding memory state), or both together.  Reverting to
>> +        partial state may still be viable, but typically, booting from
>> +        captured disk state without corresponding memory is comparable
>> +        to rebooting a machine that had power cut before I/O could be
>> +        flushed. Guests may need to use proper journaling methods to
>> +        avoid problems when booting from partial state.
> 
> While reverting disks without memory is what most systems are somewhat
> designed to handle everting memory state without disk state is a receipe
> for disaster.

I'll try to word that in there.

> 
>> +      </dd>
>> +
>> +      <dt>Quiescing of data</dt>
>> +      <dd>Even if a guest has no pending I/O, capturing disk state may
>> +        catch the guest at a time when the contents of the disk are
>> +        inconsistent. Cooperating with the guest to perform data
>> +        quiescing is an optional step to ensure that captured disk
>> +        state is fully consistent without requiring additional memory
>> +        state, rather than just crash-consistent.  But guest
>> +        cooperation may also have time constraints, where the guest
>> +        can rightfully panic if there is too much downtime while I/O
>> +        is frozen.
> 
> Also that still does not guarantee that the filesystem is fully
> consistent.

I guess your argument there is that depending on the guest's 
implementation of responding to the request to quiesce memory, it might 
still leave things inconsistent in the filesystem (but probably because 
in that case, the guest is prepared to deal with such inconsistency, 
because after all, the guest is already cooperating to advertise that it 
can get to a state of minimal pending I/O).  Is it worth adding your 
sentence or any other explanation, or can this paragraph stand as is?

> 
>> +      </dd>
>> +
>> +      <dt>Quantity of files</dt>
>> +      <dd>When capturing state, some approaches store all state within
>> +        the same file (internal), while others expand a chain of
>> +        related files that must be used together (external), for more
> 
> It's more an array of files. We use the term chain for the disk image
> chain mostly.

Good wording improvement (here, and above).

> 
>> +        files that a management application must track.
>> +      </dd>
>> +
>> +      <dt>Impact to guest definition</dt>
>> +      <dd>Capturing state may require temporary changes to the guest
>> +        definition, such as associating new files into the domain
>> +        definition. While state capture should never impact the
>> +        running guest, a change to the domain's active XML may have
>> +        impact on other host operations being performed on the domain.
> 
> External snapshots make permanent changes.

Well, they are changes to the permanent XML, but temporary in the sense 
that you can later use block commit to get back to the pre-snapshot state.

> 
> Also I'm not quite sure what you mean by impacting other operations.

Maybe:

Capturing state may require changes to the guest definition, such as 
associating new files into the domain definition. While such changes 
should never impact the running guest, the fact that the domain's active 
XML has changed may have implications to other operations on the domain, 
particularly for a management application that tracks things via 
transient domains.

> 
>> +      </dd>
>> +
>> +      <dt>Third-party integration</dt>
>> +      <dd>When capturing state, there are tradeoffs to how much of the
>> +        process must be done directly by the hypervisor, and how much
>> +        can be off-loaded to third-party software.  Since capturing
>> +        state is not instantaneous, it is essential that any
>> +        third-party integration see consistent data even if the
>> +        running guest continues to modify that data after the point in
>> +        time of the capture.</dd>
> 
> This is mostly related to the backup operation but the paragraph does
> not mention it.

It's also possible with external snapshots (a third party can do 
whatever with the read-only backing file after the snapshot, and prior 
to any commit operation). I view the document as giving two lists: 
first, a list of tradeoffs; and second, a list of existing (including 
new) APIs and where they fit on the continuum of tradeoffs.  So the 
first half does not have to call out which APIs provide which benefits.

> 
>> +      <dt>Full vs. incremental</dt>
>> +      <dd>When periodically repeating the action of state capture, it
>> +        is useful to minimize the amount of state that must be
>> +        captured by exploiting the relation to a previous capture,
>> +        such as focusing only on the portions of the disk that the
>> +        guest has modified in the meantime.  Some approaches are able
>> +        to take advantage of checkpoints to provide an incremental
>> +        backup, while others are only capable of a full backup even if
>> +        that means re-capturing unchanged portions of the disk.</dd>
> 
> Also there is currently no viable incremental memory state capture.

True - that limitation is probably worth listing. (Then again, memory is 
so much smaller than disk sizes that incremental memory snapshots make 
less sense in the first place).

> 
>> +
>> +      <dt>Local vs. remote</dt>
>> +      <dd>Domains that completely use remote storage may only need
>> +        some mechanism to keep track of guest memory state while using
>> +        external means to manage storage. Still, hypervisor and guest
>> +        cooperation to ensure points in time when no I/O is in flight
>> +        across the network can be important for properly capturing
>> +        disk state.</dd>
>> +
>> +      <dt>Network latency</dt>
>> +      <dd>Whether it's domain storage or saving domain state into
>> +        remote storage, network latency has an impact on snapshot
>> +        data. Having dedicated network capacity, bandwidth, or quality
>> +        of service levels may play a role, as well as planning for how
>> +        much of the backup process needs to be local.</dd>
>> +    </dl>
>> +
>> +    <p>
>> +      An example of the various facets in action is migration of a
>> +      running guest. In order for the guest to be able to resume on
>> +      the destination at the same place it left off at the source, the
>> +      hypervisor has to get to a point where execution on the source
>> +      is stopped, the last remaining changes occurring since the
>> +      migration started are then transferred, and the guest is started
>> +      on the target. The management software thus must keep track of
>> +      the starting point and any changes since the starting
>> +      point. These last changes are often referred to as dirty page
>> +      tracking or dirty disk block bitmaps. At some point in time
>> +      during the migration, the management software must freeze the
>> +      source guest, transfer the dirty data, and then start the guest
>> +      on the target. This period of time must be minimal. To minimize
>> +      overall migration time, one is advised to use a dedicated
>> +      network connection with a high quality of service. Alternatively
>> +      saving the current state of the running guest can just be a
>> +      point in time type operation which doesn't require updating the
>> +      "last vestiges" of state prior to writing out the saved state
>> +      file. The state file is the point in time of whatever is current
>> +      and may contain incomplete data which if used to restart the
>> +      guest could cause confusion or problems because some operation
>> +      wasn't completed depending upon where in time the operation was
>> +      commenced.
> 
> The last sentence is not clear to me.

Indeed, and now I can't even figure out what was meant - that text came 
from John:
https://www.redhat.com/archives/libvir-list/2018-June/msg01614.html

Maybe just drop everything beginning with "Alternatively".

> 
>> +    </p>
>> +
>> +    <h2><a id="apis">State capture APIs</a></h2>
>> +    <p>With those definitions, the following libvirt APIs related to
>> +      state capture have these properties:</p>
>> +    <dl>
>> +      <dt>virDomainManagedSave</dt>
>> +      <dd>This API saves guest memory, with libvirt managing all of
>> +        the saved state, then stops the guest. While stopped, the
>> +        disks can be copied by a third party.  However, since any
>> +        subsequent restart of the guest by libvirt API will restore
>> +        the memory state (which typically only works if the disk state
>> +        is unchanged in the meantime), and since it is not possible to
>> +        get at the memory state that libvirt is managing, this is not
>> +        viable as a means for rolling back to earlier saved states,
>> +        but is rather more suited to situations such as suspending a
>> +        guest prior to rebooting the host in order to resume the guest
>> +        when the host is back up. This API also has a drawback of
>> +        potentially long guest downtime, and therefore does not lend
>> +        itself well to live backups.</dd>
> 
> Well, it's purpose is completely different. Should we really mention
> backups here at all?

Maybe reword that last phrase "and therefore does not lend itself well 
to a situation that requires live guest responsiveness".

> 
>> +      <dt>virDomainSave</dt>
>> +      <dd>This API is similar to virDomainManagedSave(), but moves the
>> +        burden on managing the stored memory state to the user. As
>> +        such, the user can now couple saved state with copies of the
>> +        disks to perform a revert to an arbitrary earlier saved state.
>> +        However, changing who manages the memory state does not change
>> +        the drawback of potentially long guest downtime when capturing
>> +        state.</dd>
> 
> Again I'd not hint to the possibility that this can be abused to do
> crappy snapshots.

I'm trying to point out ALL the APIs that are related to saving state, 
even if they shouldn't be used as poor-man's snapshots.  Maybe:

The user can now combine saved state files with copies of disk contents 
made while the guest was offline in order to revert the guest back to 
that point in time, although it still suffers from potentially long 
guest downtime and is not very convenient as a form of state snapshots.

> 
>> +      <dt>virDomainSnapshotCreateXML()</dt>
>> +      <dd>This API wraps several approaches for capturing guest state,
>> +        with a general premise of creating a snapshot (where the
>> +        current guest resources are frozen in time and a new wrapper
>> +        layer is opened for tracking subsequent guest changes).  It
>> +        can operate on both offline and running guests, can choose
>> +        whether to capture the state of memory, disk, or both when
>> +        used on a running guest, and can choose between internal and
>> +        external storage for captured state.  However, it is geared
>> +        towards post-event captures (when capturing both memory and
>> +        disk state, the disk state is not captured until all memory
>> +        state has been collected first).  Using QEMU as the
>> +        hypervisor, internal snapshots currently have lengthy downtime
>> +        that is incompatible with freezing guest I/O, but external
> 
> This might be subject to change. Also capturing memory state with
> external snapshots has "downtime" implications or needlesly expands the
> memory image if --live is used.

Yes, the drawback of needless memory image expansion with --live should 
be mentioned (due to the migration stream being an append-only pipeline 
that allocates new bytes when the same memory is dirtied more than once 
during the migration phase, rather than a seekable stream that could 
overwrite existing bytes)

> 
>> +        snapshots are quick.  Since creating an external snapshot
>> +        changes which disk image resource is in use by the guest, this
>> +        API can be coupled with <code>virDomainBlockCommit()</code> to
>> +        restore things back to the guest using its original disk
> 
> Please don't mention this at all since the idea is to use libvirt APIs
> to delete external snapshots once implemented which will do the commit
> internally.

Why not? Can't we update the document at the time the libvirt APIs 
actually work?  If someone is trying to choose between libvirt APIs that 
work now, knowing a (less-than-ideal) way to revert is more useful than 
"patches are still needed, sorry we've been several years at the task".

> 
>> +        image, where a third-party tool can read the backing file
>> +        prior to the live commit.  See also
>> +        the <a href="formatsnapshot.html">XML details</a> used with
>> +        this command.</dd>
>> +
>> +      <dt>virDomainFSFreeze(), virDomainFSThaw()</dt>
>> +      <dd>This pair of APIs does not directly capture guest state, but
>> +        can be used to coordinate with a trusted live guest that state
>> +        capture is about to happen, and therefore guest I/O should be
>> +        quiesced so that the state capture is fully consistent, rather
>> +        than merely crash consistent.  Some APIs are able to
>> +        automatically perform a freeze and thaw via a flags parameter,
>> +        rather than having to make separate calls to these
>> +        functions. Also, note that freezing guest I/O is only possible
>> +        with trusted guests running a guest agent, and that some
>> +        guests place maximum time limits on how long I/O can be
>> +        frozen.</dd>
>> +
>> +      <dt>virDomainBlockCopy()</dt>
>> +      <dd>This API wraps approaches for capturing the disk state (but
>> +        not memory) of a running guest, but does not track
>> +        accompanying guest memory state, but can only operate on one
> 
> Memory state mentioned twice.

Will fix.

> 
>> +        block device per job.  To get a consistent copy of multiple
>> +        disks, multiple jobs just be run in parallel, then the domain
>> +        must be paused before ending all of the jobs.  The capture is
> 
> Again, this is a side effect of this API. The original idea is meant to
> allow moving storage to a different place while the VM is running.
> 
> Also the whole paragraph is heavily biased towards backup while the
> document is attempting to be neutral on the matter of state capture.

Hmm. Any suggestions for better wording then?

> 
>> +        consistent only at the end of the operation with a choice for
>> +        future guest changes to either pivot to the new file or to
>> +        resume to just using the original file.  The resulting backup
>> +        file is thus the other file no longer in use by the
>> +        guest.</dd>
>> +
>> +      <dt>virDomainCheckpointCreateXML()</dt>
>> +      <dd>This API does not actually capture guest state, rather it
>> +        makes it possible to track which portions of guest disks have
>> +        changed between a checkpoint and the current live execution of
>> +        the guest.  However, while it is possible use this API to
>> +        create checkpoints in isolation, it is more typical to create
>> +        a checkpoint as a side-effect of starting a new incremental
>> +        backup with <code>virDomainBackupBegin()</code>, since a
>> +        second incremental backup is most useful when using the
>> +        checkpoint created during the first.  <!--See also
> 
> Well, it would not really be incremental if you don't create the
> checkpoint.

Maybe this is better:

This API does not actually capture guest state, but makes it possible to 
track which portions of guest disks have changed between the checkpoint 
and the current live execution of the guest. Using this API creates an 
isolated checkpoint, which can provide insights into storage access 
patterns the guest is making; but when it comes to saving guest state 
for later reverts, it is instead better to atomically create a 
checkpoint as a side-effect of capturing an incremental backup, using 
<code>virDomainBackupBegin()</code>

> 
>> +        the <a href="formatcheckpoint.html">XML details</a> used with
>> +        this command.--></dd>
>> +
>> +      <dt>virDomainBackupBegin(), virDomainBackupEnd()</dt>
>> +      <dd>This API wraps approaches for capturing the state of disks
>> +        of a running guest, but does not track accompanying guest
>> +        memory state.  The capture is consistent to the start of the
>> +        operation, where the captured state is stored independently
>> +        from the disk image in use with the guest and where it can be
>> +        easily integrated with a third-party for capturing the disk
>> +        state.  Since the backup operation is stored externally from
>> +        the guest resources, there is no need to commit data back in
>> +        at the completion of the operation.  When coupled with
>> +        checkpoints, this can be used to capture incremental backups
>> +        instead of full.</dd>
>> +    </dl>
>> +
>> +    <h2><a id="examples">Examples</a></h2>
>> +    <p>The following two sequences both accomplish the task of
>> +      capturing the disk state of a running guest, then wrapping
>> +      things up so that the guest is still running with the same file
>> +      as its disk image as before the sequence of operations began.
>> +      The difference between the two sequences boils down to the
>> +      impact of an unexpected interruption made at any point in the
>> +      middle of the sequence: with such an interruption, the first
>> +      example leaves the guest tied to a temporary wrapper file rather
>> +      than the original disk, and requires manual clean up of the
>> +      domain definition; while the second example has no impact to the
>> +      domain definition.</p>
>> +
>> +    <p>1. Backup via temporary snapshot
>> +      <pre>
>> +virDomainFSFreeze()
> 
> There is a flag for these for the snapshot API
> 
>> +virDomainSnapshotCreateXML(VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY)
> 
> This creates snapshot metadata in libvirt ...
> 
>> +virDomainFSThaw()
>> +third-party copy the backing file to backup storage # most time spent here
>> +virDomainBlockCommit(VIR_DOMAIN_BLOCK_COMMIT_ACTIVE) per disk
> 
> ... which is invalidated here. (Yes the images stay in place but the
> snapshot is not consistent if not done properly.

Ah right. Either I should mention the NO_METADATA hack (libvirt isn't 
even tracking the snapshot) or else mention the 
virDomainSnapshotDelete(METADATA_ONLY) for getting rid of the snapshot 
chain that is no longer useful.

> 
>> +wait for commit ready event per disk
>> +virDomainBlockJobAbort() per disk
> 
> As noted above, please don't document this hack.

It's been documented elsewhere, and is at least a way to capture a 
backup image with existing libvirt, in spite of how hacky it is (and 
that is part of the reason why it has taken us so long to get to any 
better API - it's hard to justify new stuff when existing hacks work, 
even if they are hacky).

> 
>> +      </pre></p>
>> +
>> +    <p>2. Direct backup
>> +      <pre>
>> +virDomainFSFreeze()
>> +virDomainBackupBegin()
>> +virDomainFSThaw()
>> +wait for push mode event, or pull data over NBD # most time spent here
>> +virDomainBackeupEnd()
>> +    </pre></p>
>> +
>> +  </body>
>> +</html>
>> diff --git a/docs/formatsnapshot.html.in b/docs/formatsnapshot.html.in
>> index c60b4fb7c9..9ee355198f 100644
>> --- a/docs/formatsnapshot.html.in
>> +++ b/docs/formatsnapshot.html.in
>> @@ -9,6 +9,8 @@
>>       <h2><a id="SnapshotAttributes">Snapshot XML</a></h2>
>>
>>       <p>
>> +      Snapshots are one form
>> +      of <a href="domainstatecapture.html">domain state capture</a>.
>>         There are several types of snapshots:
>>       </p>
>>       <dl>
> 
> Other than the above it seems like a worthy high level documentation.
> Since it attempts to cover all kinds of state I'd prefer if it does not
> bias towadrds backups.

Hmm, I sort of see what you mean. I only gave two lists (tradeoffs, and 
then how existing APIs fare on those tradeoffs with a bias for using 
those APIs for backups), then an example (focused solely on backups). 
With more time, the document might be even better with even more 
separation (a list of tradeoffs, a list of existing APIs but without any 
bias towards backup, and then a more exhaustive list of examples showing 
not only backups, but also managed save, data analysis using 
checkpoints, and so on).  But how much of that has to be done before 
this series can go in, vs. as incremental followup patches that further 
improve the document?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org