[libvirt] [PATCH v9 01/13] backup: Document new XML for checkpoints

Eric Blake eblake at redhat.com
Mon Jul 8 18:33:09 UTC 2019

On 7/8/19 6:23 AM, Peter Krempa wrote:
> On Sat, Jul 06, 2019 at 22:56:01 -0500, Eric Blake wrote:
>> Prepare for new checkpoint APIs by describing the XML that will
>> represent a checkpoint.  The checkpoint XML is modeled heavily after
>> virDomainSnapshotPtr. See the docs for more details.
>> Add testsuite coverage for some minimal uses of the XML (bare minimum,
>> the sample from html, and a full dumpxml). Although use of the
>> REDEFINE flag will require the <domain> subelement to be present, it
>> is easier for most of the tests to provide counterpart output produced
>> with the NO_DOMAIN flag (particularly since synthesizing a valid
>> <domain> during testing is not trivial).
>> Signed-off-by: Eric Blake <eblake at redhat.com>
>> ---
> [...]

>> +    <h2><a id="CheckpointAttributes">Checkpoint XML</a></h2>
>> +
>> +    <p>
>> +      One method of capturing domain disk backups is via the use of
>> +      incremental backups. Right now, incremental backups are only
>> +      supported for the QEMU hypervisor when using qcow2 disks at the
>> +      active layer; if other disk formats are in use, capturing disk
>> +      backups requires different libvirt APIs.
> Perhaps add a 'since' to note that all of this was added starting from
> 5.6.


>> +    </p>
>> +    <p>
>> +      Libvirt is able to facilitate incremental backups by tracking
>> +      disk checkpoints, which are points in time against which it is
>> +      easy to compute which portion of the disk has changed. Given a
>> +      full backup (a backup created from the creation of the disk to a
>> +      given point in time), coupled with the creation of a disk
>> +      checkpoint at that time, and an incremental backup (a backup
>> +      created from just the dirty portion of the disk between the
>> +      first checkpoint and the second backup operation), it is
>> +      possible to do an offline reconstruction of the state of the
>> +      disk at the time of the second backup without having to copy as
>> +      much data as a second full backup would require. Future API
>> +      additions will make it possible to create checkpoints in
>> +      conjunction with a backup
>> +      via <code>virDomainBackupBegin()</code> or with an external
>> +      snapshot via <code>virDomainSnapshotCreateXML2</code>; but for
>> +      now, libvirt exposes enough support to create disk checkpoints
>> +      independently from a backup operation
>> +      via <code>virDomainCheckpointCreateXML()</code>.
> So it will not happen in this release? Otherwise I don't see a point in
> documenting it.

virDomainBackupBegin still has a fighting chance of making it into 5.6;
virDomainSnapshotCreateXML2 is further down the road.  My most recent
post of 'backup-v9' for backup additions tweak this paragraph yet again
when virDomainBackupBegin is added:

>> +    <p>
>> +      A checkpoint is marked current if the hypervisor is actively
>> +      updating the checkpoint as guest operations modify the disk
>> +      after the point where the checkpoint was created. When tracking
>> +      multiple checkpoints, a hypervisor may choose to either have all
>> +      relevant checkpoints be current (a guest write updates thea
> So to reiterate. By definition all checkpoint should be "current",
> because they must be actively updated to be able to track the difference
> from the point when they were created until now.
> The fact whether it includes update to a bitmap and which bitmaps are
> updated does not seem to be relevant for outside users.
> In fact we may even opt to change how this is done or perhaps convert
> one layout of bitmaps into a different one if we insist.
> Thus I'm saying it again. "current" property seems to be wrong in
> conjunction with checkpoints.

Updating this thread based on an IRC conversation (I'm not pasting the
entire conversation, but hitting the key points, so Peter may still want
to chime in):

There's two ways to do differential backups in qemu: either make all
checkpoints use an active bitmap (no parent relationship is needed, all
checkpoints would appear as roots), or change which bitmap is active on
a per-disk basis when a newer checkpoint touches that disk (checkpoints
need a parent relationship in order to reconstruct which bitmaps to
merge together when performing a differential backup).  But where things
get interesting is if a user performs a checkpoint/incremental backup of
only a subset of the domain's disks.

So, consider a domain with three disks, A, B, and C, and where we have
two users interested in different incremental backup goals: user 1 only
cares about backups of disk A and B, user 2 only cares about backups of
disk B and C.  Over time, we have user 1 create chk1.1, user 2 creates
chk2.1, user 1 creates chk1.2, user 2 creates chk2.2.

chk1.1 => A.chk1.1, B.chk1.1
chk2.1 =>           B.chk2.1, C.chk2.1
chk1.2 => A.chk1.2, B.chk1.2
chk2.2 =>           B.chk2.2, C.chk2.2

then what happens when user 1 wants to do what they think is an
incremental backup starting from chk1.2 or a differential backup
starting from chk1.1.  If we always leave bitmaps active, then we don't
need to track any parent relationships; the incremental or differential
backup is trivial: use the single bitmap recorded in the checkpoint
being used as the starting point.  However, disk B has a lot of overhead
with every guest write potentially updating 4 bitmaps.

Or, we can take the approach that Checkpoints are always created with
the parent object being the most-recent previously-created checkpoint
(this has to be tracked somewhere, but not necessarily exposed through
the XML nor through specialized API - even though the specialized API
was how the Snapshot was implemented), AND with the notion that every
disk has at most one active bitmap, but the active bitmap might not be
in the most-recent checkpoint.  When a checkpoint is created, it's
parent is assigned to be the most-recent existing checkpoint, and for
every disk in the new checkpoint, any earlier checkpoint (possibly
further back than the parent) touching the same disk has that earlier
bitmap disabled.

Or using the above figure for reference, the act of creating chk1.2
touches only disks A and B, but sets chk2.1 as the immediate parent
(even though it covers a different set of disks).  Furthermore, in
addition to creating new bitmaps for disk A and B, it disables A.chk1.1
(found in the grandparent checkpoint) and B.chk2.1 (in the parent, since
B.chk1.1 is already disabled).  Then, after chk2.2 is created, and the
user requests chk1.2 as the point of reference for a backup operation,
the user is requesting an incremental backup of disk A (can use bitmap
A.chk1.2 in isolation)  at the same time as a differential backup of
disk B (the merge of bitmaps B.chk.1.2 and B.chk2.2).  Thus, both the
act of creating a checkpoint and the act of starting a backup operation
have to concern themselves with performing a per-disk search along the
checkpoint chain from the current checkpoint back through ancestors
(creation chases until the chain ends, backup chases until the
point-of-reference checkpoint).  What's more, a checkpoint being current
is orthogonal to whether a bitmap is active, so we really never do have
more than one current checkpoint, and the documentation attempt here is
wrong (current is not about which bitmaps are active, but about which
checkpoint becomes the parent when a new checkpoint is added).

However, one observation we made on IRC was the following: when
snapshots were first created, we needed a way to recreate the libvirt
metadata, and the list APIs listed snapshots in an arbitrary order. That
meant that when using the REDEFINE flag, you needed to pass in a second
flag CURRENT to mark which snapshot being redefined was to become the
domain's current snapshot (and thus the implicit parent of any future
snapshot) - and as a result of that API, a side-effect was that you
could arbitrarily change the current snapshot ('virsh snapshot-current'
is rather gross, but shows this abuse).  Also, snapshots have a revert
operation, which changes the current snapshot and makes it very easy to
have a current snapshot which is not a leaf node.  On the other hand,
with checkpoints and our current interlock against mixing with snapshot
revert, the current node is always a leaf node, and if you perform your
listing of all checkpoints with the recently introduced TOPOLOGICAL
flag, then it is just as easy to state that when using REDEFINE in the
same order, the last checkpoint thus redefined becomes the current node.

It may still be possible to create multiple child checkpoints to a
single parent once we work out how to enable revert-to-snapshots
combined with checkpoints. And when that happens, we may need to add a
CHECKPOINT_CURRENT flag for use with REDEFINE at that time, but it does
not have to be today.  In other words, for as long as our checkpoint
chain is linear and the current node is the (one and only) leaf, we
don't need any API for managing a current checkpoint.

The IRC conversation also pointed out that before I push things, I _do_
also need to interlock such that the existence of checkpoints prevents
blockpull/blockcommit, in case libvirt is started with an external
backing chain of bitmaps created outside of libvirt, even though libvirt
itself will refuse to create such a chain.

>> +      associated tracking for each checkpoint), or to have a single
>> +      checkpoint be current (older checkpoints stop tracking changes
> NACK, internal detail. Don't document it. Also I'll object to this later
> in detail.
>> +      when a newer one is created, and describing all changes since an
>> +      older checkpoint then involves libvirt automatically combining
>> +      the changes from the chain of checkpoints from the current back
>> +      to the checkpoint in question; the QEMU hypervisor uses this
>> +      method). Checkpoints are maintained in a hierarchy to facilitate
> Nope. All of the above seem to be internal stuff.

So for my next spin of this patch, my goal is to eliminate as much
public API/XML related to a current checkpoint as possible, and see if I
can implement it completely internally by using the one-and-only leaf
node as the current checkpoint.

>> +      </dd>
>> +      <dt><code>parent</code></dt>
>> +      <dd>An optional readonly representation of the parent of this
>> +        checkpoint. If present, this element contains exactly one
>> +        child element, <code>name</code>.
> This reads weird and does not seem to say much.

Copying from snapshots.  I kept the same XML for consistency:


even though if I were designing it from scratch, I would prefer:


But maybe I can tweak the wording in that paragraph (and its counterpart
in the snapshot docs) to be easier to understand.

>> +      </dd>
>> +      <dt><code>current</code></dt>
>> +      <dd>A readonly integer, 1 if this checkpoint is current (that
>> +        is, actively tracking guest changes) or 0 if the checkpoint is
>> +        inactive because a child checkpoint is now tracking changes.
>> +      </dd>
> See below in the schema. The contents of this element are wrong.

If I can get by without any mention of current in the public XML
(perhaps by relying on the rule that as long as we have interlocks with
snapshots, we can't branch, and therefore there is at most one leaf
node, and simply define that the leaf node IS the current node), then
this goes away.

>> +++ b/docs/formatsnapshot.html.in
>> @@ -91,7 +91,9 @@
>>        sets that snapshot as current, and the prior current snapshot is
>>        the parent of the new snapshot.  Branches in the hierarchy can
>>        be formed by reverting to a snapshot with a child, then creating
>> -      another snapshot.
>> +      another snapshot.  For now, the creation of external snapshots
>> +      when checkpoints exist is forbidden, although future work will
>> +      make it possible to integrate these two concepts.
>>      </p>
>>      <p>
>>        The top-level <code>domainsnapshot</code> element may contain
> I'd slightly prefer if the snapshot-interlock stuff is in a separate
> patch.

Okay, I can split.

>> +  <define name='domaincheckpoint'>
>> +    <element name='domaincheckpoint'>
>> +      <interleave>
>> +        <optional>
>> +          <element name='name'>
>> +            <text/>
> This should be something more sensible than '<text/>' we have plenty of
> types for names which exclude newlines and '/'.

Copying from <domainsnapshot> but yes, that's not a good excuse, and
it's always easier to relax later than it is to restrict later.  And
since we DO store the checkpoint name as a file (at least for qemu),
restricting the RNG up front makes it easier to prevent the use of '../'
as a name, since we're strictly validating RNG compliance.

>> +          </element>
>> +        </optional>
>> +        <optional>
>> +          <element name='description'>
>> +            <text/>
>> +          </element>
>> +        </optional>
>> +        <optional>
>> +          <element name='creationTime'>
>> +            <text/>
> This is supposed to be a number.

Copying from <domainsnapshot>, but yes this can be improved.

>> +          </element>
>> +        </optional>
>> +        <optional>
>> +          <element name='current'>
>> +            <choice>
>> +              <value>0</value>
>> +              <value>1</value>
> This looks weird. I see 3 options to make it better:
> 1) put it as an attribute of 'domaincheckpoint' (and use 'yes'/'no' as
> values)
> 2) encode it as presence of the element
> 3) at least use 'yes'/'no' as values.

or 4) don't have it, after all (at least, not until we work out whether
integration with snapshot revert will require it).

> ACK if you sanitize the schema and avoid documenting internals in the
> public docs. I think that also the bitmap name field should be
> documented as internal.

Makes sense - the field will still be in the RNG schema (for two
reasons: REDEFINE must document everything it will parse, and when not
using REDEFINE, it's still often easier to copy-and-paste an existing
checkpoint XML to form a new checkpoint XML, without having to worry
about scrubbing out the readonly fields that will be ignored in the new
checkpoint), but I can document it as ignored during creation and
read-only in output for which bitmap got created.

Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20190708/caf85b89/attachment-0001.sig>

More information about the libvir-list mailing list