[libvirt] [PATCH 2/8] backup: Document nuances between different state capture APIs

Mon Jun 25 13:57:06 UTC 2018

On 06/13/2018 12:42 PM, Eric Blake wrote:
> Upcoming patches will add support for incremental backups via
> a new API; but first, we need a landing page that gives an
> overview of capturing various pieces of guest state, and which
> APIs are best suited to which tasks.
> 
> Signed-off-by: Eric Blake <eblake at redhat.com>
> ---
>  docs/docs.html.in               |   5 ++
>  docs/domainstatecapture.html.in | 190 ++++++++++++++++++++++++++++++++++++++++
>  docs/formatsnapshot.html.in     |   2 +
>  3 files changed, 197 insertions(+)
>  create mode 100644 docs/domainstatecapture.html.in
> 

This got a lot messier than originally intended. As noted in my response
for .1 - I haven't really followed the discussions thus far - so take it
with that viewpoint - someone from outside the current discussion trying
to make sense of what this topic is all about.

> diff --git a/docs/docs.html.in b/docs/docs.html.in
> index 40e0e3b82e..4c46b74980 100644
> --- a/docs/docs.html.in
> +++ b/docs/docs.html.in
> @@ -120,6 +120,11 @@
> 
>          <dt><a href="secureusage.html">Secure usage</a></dt>
>          <dd>Secure usage of the libvirt APIs</dd>
> +
> +        <dt><a href="domainstatecapture.html">Domain state
> +            capture</a></dt>
> +        <dd>Comparison between different methods of capturing domain
> +          state</dd>
>        </dl>
>      </div>
> 
> diff --git a/docs/domainstatecapture.html.in b/docs/domainstatecapture.html.in
> new file mode 100644
> index 0000000000..00ab7e8ee1
> --- /dev/null
> +++ b/docs/domainstatecapture.html.in
> @@ -0,0 +1,190 @@
> +<?xml version="1.0" encoding="UTF-8"?>
> +<!DOCTYPE html>
> +<html xmlns="http://www.w3.org/1999/xhtml">
> +  <body>
> +
> +    <h1>Domain state capture using Libvirt</h1>
> +
> +    <ul id="toc"></ul>
> +
> +    <p>
> +      This page compares the different means for capturing state
> +      related to a domain managed by libvirt, in order to aid
> +      application developers to choose which operations best suit
> +      their needs.

I would alter the sentence at the comma... IOW:

In order to aid ... their needs, this page compares ... by libvirt.

Then rather than discussing this below - I think we really need to state
right at the top the following:

</p>
<p>
The information here is primarily geared towards capturing the state of
the active domain. Capturing state of an inactive domain essentially
only requires the contents of the guest disks and then restoring that
state is merely a fresh boot with the disks restored to that state.
There are aspects of the subsequent functionality that cover the
inactive state collection, but it's not the primary focus.

> +    </p>
> +
> +    <h2><a id="definitions">State capture trade-offs</a></h2>
> +
> +    <p>One of the features made possible with virtual machines is live
> +      migration, or transferring all state related to the guest from
> +      one host to another, with minimal interruption to the guest's

to me the commas are unnecessary.

> +      activity.  A clever observer will then note that if all state is

s/activity./activity. In this case, state includes domain memory
including the current instruction stream and domain storage, whether
that is local virtual disks which are not present on a target host or
networked storage being updated by the local hypervisor. A clever...

[BTW: In rereading my response - I almost want to add - "As it relates
to domain checkpoints and backups, state only includes disk state
change.".  However, I'm not sure if that ties in yet or not. I think it
only matters for the two new API's being discussed.]

> +      available for live migration, there is nothing stopping a user

, then there is...

> +      from saving that state at a given point of time, to be able to

s/,/ in order/

> +      later rewind guest execution back to the state it previously
> +      had.  There are several different libvirt APIs associated with

[BTW: The following includes something else I pulled up from the list
below...]

s/had. /had. The astute reader will also realize that state capture at
any level requires that the data must be stored and managed by some
mechanism. This processing may be to a single file or some set of
chained files. This is the inflection point between where Libvirt would
(could, should?) integrate with third party tools that are built around
managing the volume of data possibly generated by multiple domains with
multiple disks. This leaves the task of synchronizing the capture
algorithms to Libvirt in order to be able to work seamlessly with the
underlying hypervisor.

<paragraph break>

There are several libvirt APIs associated with ...

(different is superfluous)

> +      capturing the state of a guest, such that the captured state can

s/, such that the captured state/which

> +      later be used to rewind that guest to the conditions it was in
> +      earlier.  But since there are multiple APIs, it is best to
> +      understand the tradeoffs and differences between them, in order> +      to choose the best API for a given task.

s/But since ... given task./The following is a list of trade-offs and
differences between the various facets that affect capturing domain
state for active domains:/

> +    </p>
> +
> +    <dl>
  > +      <dt>Timing</dt>

"Data Completeness"  (or Integrity)

> +      <dd>Capturing state can be a lengthy process, so while the
> +        captured state ideally represents an atomic point in time
> +        correpsonding to something the guest was actually executing,

corresponding

> +        some interfaces require up-front preparation (the state

s/preparation (the.../preparation. The ...

> +        captured is not complete until the API ends, which may be some
> +        time after the command was first started), while other

s/started), while .../started. While .../

> +        interfaces track the state when the command was first issued
> +        even if it takes some time to finish capturing the state.

Feels like a paragraph break... Or even a whole new bullet:

   <dt>Quiescing of Data</dt>

> +        While it is possible to freeze guest I/O around either point
> +        in time (so that the captured state is fully consistent,

s/around either point in time/at any point in time/

> +        rather than just crash-consistent), knowing whether the state
> +        is captured at the start or end of the command may determine
> +        which approach to use.  A related concept is the amount of
> +        downtime the guest will experience during the capture,
> +        particularly since freezing guest I/O has time
> +        constraints.</dd>

That last sentence:

Freezing guest I/O can be problematic depending on what the guest's
expectations are and the duration of the freeze. Some software will
rightfully panic once it is given the chance to realize it lost some
number of seconds. In general though long pauses are unacceptable, so
reducing the time spent frozen is a goal of management software. Still
the balance for data integrity and completeness does require some amount
of time.

> +
> +      <dt>Amount of state</dt>

I pulled part of this to earlier - let's face it, the offline guest is
not the focus of this work, so I think it's only worth discussing or
noting that an API can handle active or inactive guests.

> +      <dd>For an offline guest, only the contents of the guest disks
> +        needs to be captured; restoring that state is merely a fresh
> +        boot with the disks restored to that state.  But for an online

<dt>Memory State only, Disk change only, or Both</dt>

> +        guest, there is a choice between storing the guest's memory
> +        (all that is needed during live migration where the storage is
> +        shared between source and destination), the guest's disk state
> +        (all that is needed if there are no pending guest I/O
> +        transactions that would be lost without the corresponding
> +        memory state), or both together.  Unless guest I/O is quiesced
> +        prior to capturing state, then reverting to captured disk
> +        state of a live guest without the corresponding memory state
> +        is comparable to booting a machine that previously lost power
> +        without a clean shutdown; but for a guest that uses
> +        appropriate journaling methods, this crash-consistent state
> +        may be sufficient to avoid the additional storage and time
> +        needed to capture memory state.</dd>
> +
> +      <dt>Quantity of files</dt>

I ended up essentially moving this concept prior to the list of facets.
I suppose it could go here too, but I'm not sure this ends up being so
much of a tradeoff as opposed to an overall design decision made when
deciding to implement some sort of backup solution. In the long run, the
management software "doesn't care" where or how the data is stored -
it's just providing the data.

> +      <dd>When capturing state, some approaches store all state within
> +        the same file (internal), while others expand a chain of
> +        related files that must be used together (external), for more
> +        files that a management application must track.  There are> +        also differences depending on whether the state is captured in
> +        the same file in use by a running guest, or whether the state
> +        is captured to a distinct file without impacting the files
> +        used to run the guest.</dd>

That last sentence could almost be it's own bullet "Impact to Active
State". Libvirt already captures active state as part of normal
processing by updates to the domain's active XML.

> +
> +      <dt>Third-party integration</dt>

So again, this isn't a facet affecting capturing state - it seems to be
more of a statement related to the reality of what Libvirt can/should be
expected to do vs. being able to allow configurability for external
forces to make certain decisions.

> +      <dd>When capturing state, particularly for a running, there are
> +        tradeoffs to how much of the process must be done directly by
> +        the hypervisor, and how much can be off-loaded to third-party
> +        software.  Since capturing state is not instantaneous, it is
> +        essential that any third-party integration see consistent data
> +        even if the running guest continues to modify that data after
> +        the point in time of the capture.</dd>
> +
> +      <dt>Full vs. partial</dt>

How about Full vs. Incremental

> +      <dd>When capturing state, it is useful to minimize the amount of
> +        state that must be captured in relation to a previous capture,
> +        by focusing only on the portions of the disk that the guest
> +        has modified since the previous capture.  Some approaches are
> +        able to take advantage of checkpoints to provide an
> +        incremental backup, while others are only capable of a full
> +        backup including portions of the disk that have not changed
> +        since the previous state capture.</dd>

Is there a "downside" to the time needed for ma[r]king the "first
checkpoint"? Or do we dictate/assume that someone has some sort of
backup already before starting the domain and updates thereafter are all
incremental.

FWIW: A couple of others that come to mind that are facets:

<dt>Local or Remote Storage</dt>
Domains that completely use remote storage may only need some mechanism
to keep track of the guest memory state while using some external means
to manage/track the remote storage. Still even with that, it's possible
that the hypervisor has I/O's "in flight" to the network storage that
could be "important data" in the big picture. So having the capability
to have the management software keeping track of all disk state can be
important.

<dt>Network Latency</dt>
Whether it's domain storage or the saving of the domain data into some
remote storage, network latency has an impact on snapshot data. Having
dedicated network capacity/bandwidth and/or properly set quality of
service certainly helps.

> +    </dl>
> +

Perhaps in some way providing an example of a migration vs. pure save
could be helpful - using of course the various facets. Something like:

An example of the various facets in action is migration of a running
guest. In order for the guest to be able to start on a target from
whence it left off on a source, the guest has to get to a point where
execution on the source is stopped, the last remaining changes occurring
since the migration started are then transferred, and the guest is
started on the target. The management software thus must keep track of
the starting point and any changes since the starting point. These last
changes are often referred to as dirty page tracking or dirty disk block
bitmaps. At some point in time during the migration, the management
software must freeze the source guest, transfer the dirty data, and then
start the guest on the target. This period of time must be minimal. To
minimize overall migration time, one is advised to use a dedicated
network connection with a high quality of service. Alternatively saving
the current state of the running guest can just be a point in time type
operation which doesn't require updating the "last vestiges" of state
prior to writing out the saved state file. The state file is the point
in time of whatever is current and may contain incomplete data which if
used to restart the guest could cause confusion or problems because some
operation wasn't completed depending upon where in time the operation
was commenced.

> +    <h2><a id="apis">State capture APIs</a></h2>
> +    <p>With those definitions, the following libvirt APIs have these
> +      properties:</p>

Do you think perhaps it may be a good idea to list the pros and cons of
each of the APIs?  As in, why someone would want to use one over
another? and of course which work together...

> +    <dl>
> +      <dt>virDomainSnapshotCreateXML()</dt>
> +      <dd>This API wraps several approaches for capturing guest state,
> +        with a general premise of creating a snapshot (where the
> +        current guest resources are frozen in time and a new wrapper
> +        layer is opened for tracking subsequent guest changes).  It
> +        can operate on both offline and running guests, can choose
> +        whether to capture the state of memory, disk, or both when
> +        used on a running guest, and can choose between internal and
> +        external storage for captured state.  However, it is geared
> +        towards post-event captures (when capturing both memory and
> +        disk state, the disk state is not captured until all memory
> +        state has been collected first).  For qemu as the hypervisor,

s/For qemu/Using QEMU/

> +        internal snapshots currently have lengthy downtime that is
> +        incompatible with freezing guest I/O, but external snapshots
> +        are quick.  Since creating an external snapshot changes which
> +        disk image resource is in use by the guest, this API can be
> +        coupled with <code>virDomainBlockCommit()</code> to restore
> +        things back to the guest using its original disk image, where
> +        a third-party tool can read the backing file prior to the live
> +        commit.  See also the <a href="formatsnapshot.html">XML
> +        details</a> used with this command.</dd>

Some random grumbling from me about the complexity of SnapshotCreateXML
interacting with BlockCommit ;-)...   Still should perhaps the Block
API's be listed first to create a grounding of what they do so that when
discussed as part of this Snapshot section? Of course all that without
getting in the gnarly details of using Block APIs.

The SnapshotCreateXML API is a "complex maze" of flag usage where it's
important to understand the nuances between active/inactive,
internal/external, memory/disk/both, and reversion to the point in time.

> +      <dt>virDomainBlockCopy()</dt>
> +      <dd>This API wraps approaches for capturing the state of disks
> +        of a running guest, but does not track accompanying guest
> +        memory state, and can only operate on one block device per job

s/, and/ and/

realistically the part about tracking guest memory state is probably
unnecessary.

> +        (to get a consistent copy of multiple disks, the domain must
> +        be paused before ending the multiple jobs).  The capture is

s/job (to .../job. To .../
s/jobs)./jobs./

> +        consistent only at the end of the operation, with a choice to

s/, with/ with/

> +        either pivot to the new file that contains the copy (leaving
> +        the old file as the backup), or to return to the original file

s/), or/) or/

> +        (leaving the new file as the backup).</dd>

s/(//
s/)//

The next two aren't even introduced yet...  But I'd probably also want
to know about virDomain{Save|ManagedSave[Image]} which while not
snapshot level type API's, they can be used to save domain state
information. And since we discussed quiesce and freeze/thaw above,
should there should virDomainFS{Freeze|Thaw} be discussed - if only to
note their usage?

>From my perspective, the next two should be added once the two API's are
introduced. Still for grounding this was a good introduction. I think
the order should be changed since before performing an incremental
backup is possible there must be some way to set when time begins.

> +      <dt>virDomainBackupBegin()</dt>
> +      <dd>This API wraps approaches for capturing the state of disks
> +        of a running guest, but does not track accompanying guest
> +        memory state.  The capture is consistent to the start of the
> +        operation, where the captured state is stored independently
> +        from the disk image in use with the guest, and where it can be

s/, and/ and/

> +        easily integrated with a third-party for capturing the disk
> +        state.  Since the backup operation is stored externally from
> +        the guest resources, there is no need to commit data back in
> +        at the completion of the operation.  When coupled with
> +        checkpoints, this can be used to capture incremental backups
> +        instead of full.</dd>
> +      <dt>virDomainCheckpointCreateXML()</dt>
> +      <dd>This API does not actually capture guest state, so much as
> +        make it possible to track which portions of guest disks have

s/, so much as make/, rather it makes/

> +        change between checkpoints or between a current checkpoint and

changed?

> +        the live execution of the guest.  When performing incremental
> +        backups, it is easier to create a new checkpoint at the same
> +        time as a new backup, so that the next incremental backup can
> +        refer to the incremental state since the checkpoint created
> +        during the current backup.  Guest state is then actually

The "When performing ... current backup" description is a bit confusing
to read for me. So do I create Checkpoint before after the Backup or
both? Hard to be simultaneous - one comes first.

> +        captured using <code>virDomainBackupBegin()</code>.  <!--See also
> +        the <a href="formatcheckpoint.html">XML details</a> used with
> +        this command.--></dd>

This reliance on one another gets really confusing... Using the term
"Guest state" for Checkpoint to mean something different than Snapshot
makes for difficult comprehension in the grand scheme of domain management.

Of slight concern is that there's nothing in the Checkpoint or Backup
API naming scheme that says this is for disk/block only. There's also
nothing that would cause me to think they are related without reading
their descriptions.

> +    </dl>
> +
> +    <h2><a id="examples">Examples</a></h2>
> +    <p>The following two sequences both capture the disk state of a
> +      running guest, then complete with the guest running on its
> +      original disk image; but with a difference that an unexpected
> +      interruption during the first mode leaves a temporary wrapper
> +      file that must be accounted for, while interruption of the
> +      second mode has no impact to the guest.</p>
> +    <p>1. Backup via temporary snapshot
> +      <pre>
> +virDomainFSFreeze()
> +virDomainSnapshotCreateXML(VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY)
> +virDomainFSThaw()
> +third-party copy the backing file to backup storage # most time spent here
> +virDomainBlockCommit(VIR_DOMAIN_BLOCK_COMMIT_ACTIVE) per disk
> +wait for commit ready event per disk
> +virDomainBlockJobAbort() per disk
> +      </pre></p>
> +
> +    <p>2. Direct backup
> +      <pre>
> +virDomainFSFreeze()
> +virDomainBackupBegin()
> +virDomainFSThaw()
> +wait for push mode event, or pull data over NBD # most time spent here
> +virDomainBackeupEnd()
> +    </pre></p>
> +
> +  </body>
> +</html>

The examples certainly need more beef w/r/t description. Personally, I
like the fact that we're heavily documenting things before writing the
code rather than the opposite direction.

John

> diff --git a/docs/formatsnapshot.html.in b/docs/formatsnapshot.html.in
> index f2e51df5ab..d7051683a5 100644
> --- a/docs/formatsnapshot.html.in
> +++ b/docs/formatsnapshot.html.in
> @@ -9,6 +9,8 @@
>      <h2><a id="SnapshotAttributes">Snapshot XML</a></h2>
> 
>      <p>
> +      Snapshots are one form
> +      of <a href="domainstatecapture.html">domain state capture</a>.
>        There are several types of snapshots:
>      </p>
>      <dl>
>