[libvirt] Overview of libvirt incremental backup API, part 2 (incremental/differential pull mode)

Tue Oct 9 13:29:36 UTC 2018

On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <eblake at redhat.com> wrote:

> On 10/4/18 12:05 AM, Eric Blake wrote:
> > The following (long) email describes a portion of the work-flow of how
> > my proposed incremental backup APIs will work, along with the backend
> > QMP commands that each one executes.  I will reply to this thread with
> > further examples (the first example is long enough to be its own email).
> > This is an update to a thread last posted here:
> > https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html
> >
>
> > More to come in part 2.
> >
>
> - Second example: a sequence of incremental backups via pull model
>
> In the first example, we did not create a checkpoint at the time of the
> full pull. That means we have no way to track a delta of changes since
> that point in time.

Why do we want to support backup without creating a checkpoint?

If we don't have any real use case, I suggest to always require a
checkpoint.

> Let's repeat the full backup (reusing the same
> backup.xml from before), but this time, we'll add a new parameter, a
> second XML file for describing the checkpoint we want to create.
>
> Actually, it was easy enough to get virsh to write the XML for me
> (because it was very similar to existing code in virsh that creates XML
> for snapshot creation):
>
> $ $virsh checkpoint-create-as --print-xml $dom check1 testing \
>     --diskspec sdc --diskspec sdd | tee check1.xml
> <domaincheckpoint>
>    <name>check1</name>
>

We should use an id, not a name, even of name is name is also unique like
in most libvirt apis.

In RHV we will use always use a UUID for this.

>    <description>testing</description>
>    <disks>
>      <disk name='sdc'/>
>      <disk name='sdd'/>
>    </disks>
> </domaincheckpoint>
>
> I had to supply two --diskspec arguments to virsh to select just the two
> qcow2 disks that I am using in my example (rather than every disk in the
> domain, which is the default when <disks> is not present).

So <disks /> is valid configuration, selecting all disks, or not having
"disks" element
selects all disks?

> I also picked
> a name (mandatory) and description (optional) to be associated with the
> checkpoint.
>
> The backup.xml file that we plan to reuse still mentions scratch1.img
> and scratch2.img as files needed for staging the pull request. However,
> any contents in those files could interfere with our second backup
> (after all, every cluster written into that file from the first backup
> represents a point in time that was frozen at the first backup; but our
> second backup will want to read the data as the guest sees it now rather
> than what it was at the first backup), so we MUST regenerate the scratch
> files. (Perhaps I should have just deleted them at the end of example 1
> in my previous email, had I remembered when typing that mail).
>
> $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img
> $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
>
> Now, to begin the full backup and create a checkpoint at the same time.
> Also, this time around, it would be nice if the guest had a chance to
> freeze I/O to the disks prior to the point chosen as the checkpoint.
> Assuming the guest is trusted, and running the qemu guest agent (qga),
> we can do that with:
>
> $ $virsh fsfreeze $dom
> $ $virsh backup-begin $dom backup.xml check1.xml
> Backup id 1 started
> backup used description from 'backup.xml'
> checkpoint used description from 'check1.xml'
> $ $virsh fsthaw $dom
>

Great, this answer my (unsent) question about freeze/thaw from part 1 :-)

>
> and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE
> flag to combine those three steps into a single API (matching what we've
> done on some other existing API).  In other words, the sequence of QMP
> operations performed during virDomainBackupBegin are quick enough that
> they won't stall a freeze operation (at least Windows is picky if you
> stall a freeze operation longer than 10 seconds).
>

We use fsFreeze/fsThaw directly in RHV since we need to support external
snapshots (e.g. ceph), so we don't need this functionality, but it sounds
good
idea to make it work like snapshot.

>
> The tweaked $virsh backup-begin now results in a call to:
>   virDomainBackupBegin(dom, "<domainbackup ...>",
>     "<domaincheckpoint ...", 0)
> and in turn libvirt makes a similar sequence of QMP calls as before,
> with a slight modification in the middle:
> {"execute":"nbd-server-start",...
> {"execute":"blockdev-add",...
>

This does not work yet for network disks like "rbd" and "glusterfs"
does it mean that they will not be supported for backup?

> {"execute":"transaction",
>   "arguments":{"actions":[
>    {"type":"blockdev-backup", "data":{
>     "device":"$node1", "target":"backup-sdc", "sync":"none",
>     "job-id":"backup-sdc" }},
>    {"type":"blockdev-backup", "data":{
>     "device":"$node2", "target":"backup-sdd", "sync":"none",
>     "job-id":"backup-sdd" }}
>    {"type":"block-dirty-bitmap-add", "data":{
>     "node":"$node1", "name":"check1", "persistent":true}},
>    {"type":"block-dirty-bitmap-add", "data":{
>     "node":"$node2", "name":"check1", "persistent":true}}
>   ]}}
> {"execute":"nbd-server-add",...
>

What if this sequence fail in the middle? will libvirt handle all failures
and rollback to the previous state?

What is the semantics of "execute": "transaction"? does it mean that qemu
will handle all possible failures in one of the actions?

(Will continue later)

>
> The only change was adding more actions to the "transaction" command -
> in addition to kicking off the fleece image in the scratch nodes, it
> ALSO added a persistent bitmap to each of the original images, to track
> all changes made after the point of the transaction.  The bitmaps are
> persistent - at this point (well, it's better if you wait until after
> backup-end), you could shut the guest down and restart it, and libvirt
> will still remember that the checkpoint exists, and qemu will continue
> track guest writes via the bitmap. However, the backup job itself is
> currently live-only, and shutting down the guest while a backup
> operation is in effect will lose track of the backup job.  What that
> really means is that if the guest shuts down, your current backup job is
> hosed (you cannot ever get back the point-in-time data from your API
> request - as your next API request will be a new point in time) - but
> you have not permanently ruined the guest, and your recovery is to just
> start a new backup.
>
> Pulling the data out from the backup is unchanged from example 1; virsh
> backup-dumpxml will show details about the job (yes, the job id is still
> 1 for now), and when ready, virsh backup-end will end the job and
> gracefully take down the NBD server with no difference in QMP commands
> from before.  Thus, the creation of a checkpoint didn't change any of
> the fundamentals of capturing the current backup, but rather is in
> preparation for the next step.
>
> $ $virsh backup-end $dom 1
> Backup id 1 completed
> $ rm scratch1.img scratch2.img
>
> [We have not yet designed how qemu bitmaps will interact with external
> snapshots - but I see two likely scenarios:
>   1. Down the road, I add a virDomainSnapshotCheckpointCreateXML() API,
> which adds a checkpointXML parameter but otherwise behaves like the
> existing virDomainSnapshotCreateXML - if that API is added in a
> different release than my current API proposals, that's yet another
> libvirt.so rebase to pickup the new API.
>   2. My current proposal of virDomainBackupBegin(dom, "<domainbackup>",
> "<domaincheckpoint>", flags) could instead be tweaked to a single XML
> parameter, virDomainBackupBegin(dom, "
> <domainbackup>
>    <domaincheckpoint> ... </domaincheckpoint>
> </domainbackup>", flags) prior to adding my APIs to libvirt 4.9, then
> down the road, we also tweak <domainsnapshot> to take an optional
> <domaincheckpoint> sub-element, and thus reuse the existing
> virDomainSnapshotCreateXML() to now also create checkpoints without a
> further API addition.
> Speak up now if you have a preference between the two ideas]
>
> Now that we have concluded the full backup and created a checkpoint, we
> can do more things with the checkpoint (it is persistent, after all).
> For example:
>
> $ $virsh checkpoint-list $dom
>   Name                 Creation Time
> --------------------------------------------
>   check1               2018-10-04 15:02:24 -0500
>
> called virDomainListCheckpoints(dom, &array, 0) under the hood to get a
> list of virDomainCheckpointPtr objects, then called
> virDomainCheckpointGetXMLDesc(array[0], 0) to scrape the XML describing
> that checkpoint in order to display information.  Or another approach,
> using virDomainCheckpointGetXMLDesc(virDomainCheckpointCurrent(dom, 0), 0):
>
> $ $virsh checkpoint-current $dom | head
> <domaincheckpoint>
>    <name>check1</name>
>    <description>testing</description>
>    <creationTime>1538683344</creationTime>
>    <disks>
>      <disk name='vda' checkpoint='no'/>
>      <disk name='sdc' checkpoint='bitmap' bitmap='check1'/>
>      <disk name='sdd' checkpoint='bitmap' bitmap='check1'/>
>    </disks>
>    <domain type='kvm'>
>
> which shows the current checkpoint (that is, the checkpoint owning the
> bitmap that is still receiving live updates), and which bitmap names in
> the qcow2 files are in use. For convenience, it also recorded the full
> <domain> description at the time the checkpoint was captured (I used
> head to limit the size of this email), so that if you later hot-plug
> things, you still have a record of what state the machine had at the
> time the checkpoint was created.
>
> The XML output of a checkpoint description is normally static, but
> sometimes it is useful to know an approximate size of the guest data
> that has been dirtied since a checkpoint was created (a dynamic value
> that grows as a guest dirties more clusters).  For that, it makes sense
> to have a flag to request the dynamic data; it's also useful to have a
> flag that suppresses the (length) <domain> output:
>
> $ $virsh checkpoint-current $dom --size --no-domain
> <domaincheckpoint>
>    <name>check1</name>
>    <description>testing</description>
>    <creationTime>1538683344</creationTime>
>    <disks>
>      <disk name='vda' checkpoint='no'/>
>      <disk name='sdc' checkpoint='bitmap' bitmap='check1' size='1048576'/>
>      <disk name='sdd' checkpoint='bitmap' bitmap='check1' size='65536'/>
>    </disks>
> </domaincheckpoint>
>
> This maps to virDomainCheckpointGetXMLDesc(chk,
> VIR_DOMAIN_CHECKPOINT_XML_NO_DOMAIN | VIR_DOMAIN_CHECKPOINT_XML_SIZE).
> Under the hood, libvirt calls
> {"execute":"query-block"}
> and converts the bitmap size reported by qemu into an estimate of the
> number of bytes that would be required if you were to start a backup
> from that checkpoint right now.  Note that the result is just an
> estimate of the storage taken by guest-visible data; you'll probably
> want to use 'qemu-img measure' to convert that into a size of how much a
> matching qcow2 image would require when metadata is added in; also
> remember that the number is constantly growing as the guest writes and
> causes more of the image to become dirty.  But having a feel for how
> much has changed can be useful for determining if continuing a chain of
> incremental backups still makes more sense, or if enough of the guest
> data has changed that doing a full backup is smarter; it is also useful
> for preallocating how much storage you will need for an incremental backup.
>
> Technically, libvirt mapping that a checkpoint size request to a single
> {"execute":"query-block"} works only when querying the size of the
> current bitmap. The command also works when querying the cumulative size
> since an older checkpoint, but under the hood, libvirt must juggle
> things to create a temporary bitmap, call a few
> x-block-dirty-bitmap-merge, query the size of that temporary bitmap,
> then clean things back up again (after all, size(A) + size(B) >=
> size(A|B), depending on how many clusters were touched during both A and
> B's tracking of dirty clusters).  Again, a nice benefit of having
> libvirt manage multiple qemu bitmaps under a single libvirt API.
>
> Of course, the real reason we created a checkpoint with our full backup
> is that we want to take an incremental backup next, rather than
> repeatedly taking full backups. For this, we need a one-line
> modification to our backup XML to add an <incremental> element; we also
> want to update our checkpoint XML to start yet another checkpoint when
> we run our first incremental backup.
>
> $ cat > backup.xml <<EOF
> <domainbackup mode='pull'>
>    <server transport='tcp' name='localhost' port='10809'/>
>    <incremental>check1</incremental>
>    <disks>
>      <disk name='$orig1' type='file'>
>        <scratch file='$PWD/scratch1.img'/>
>      </disk>
>      <disk name='sdd' type='file'>
>        <scratch file='$PWD/scratch2.img'/>
>      </disk>
>    </disks>
> </domainbackup>
> EOF
> $ $virsh checkpoint-create-as --print-xml $dom check2 \
>     --diskspec sdc --diskspec sdd | tee check2.xml
> <domaincheckpoint>
>    <name>check2</name>
>    <disks>
>      <disk name='sdc'/>
>      <disk name='sdd'/>
>    </disks>
> </domaincheckpoint>
> $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img
> $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img
>
> And again, it's time to kick off the backup job:
>
> $ $virsh backup-begin $dom backup.xml check2.xml
> Backup id 1 started
> backup used description from 'backup.xml'
> checkpoint used description from 'check2.xml'
>
> This time, the incremental backup causes libvirt to do a bit more work
> under the hood:
>
> {"execute":"nbd-server-start",
>   "arguments":{"addr":{"type":"inet",
>    "data":{"host":"localhost", "port":"10809"}}}}
> {"execute":"blockdev-add",
>   "arguments":{"driver":"qcow2", "node-name":"backup-sdc",
>    "file":{"driver":"file",
>     "filename":"$PWD/scratch1.img"},
>     "backing":"'$node1'"}}
> {"execute":"blockdev-add",
>   "arguments":{"driver":"qcow2", "node-name":"backup-sdd",
>    "file":{"driver":"file",
>     "filename":"$PWD/scratch2.img"},
>     "backing":"'$node2'"}}
> {"execute":"block-dirty-bitmap-add",
>   "arguments":{"node":"$node1", "name":"backup-sdc"}}
> {"execute":"x-block-dirty-bitmap-merge",
>   "arguments":{"node":"$node1", "src_name":"check1",
>   "dst_name":"backup-sdc"}}'
> {"execute":"block-dirty-bitmap-add",
>   "arguments":{"node":"$node2", "name":"backup-sdd"}}
> {"execute":"x-block-dirty-bitmap-merge",
>   "arguments":{"node":"$node2", "src_name":"check1",
>   "dst_name":"backup-sdd"}}'
> {"execute":"transaction",
>   "arguments":{"actions":[
>    {"type":"blockdev-backup", "data":{
>     "device":"$node1", "target":"backup-sdc", "sync":"none",
>     "job-id":"backup-sdc" }},
>    {"type":"blockdev-backup", "data":{
>     "device":"$node2", "target":"backup-sdd", "sync":"none",
>     "job-id":"backup-sdd" }},
>    {"type":"x-block-dirty-bitmap-disable", "data":{
>     "node":"$node1", "name":"backup-sdc"}},
>    {"type":"x-block-dirty-bitmap-disable", "data":{
>     "node":"$node2", "name":"backup-sdd"}},
>    {"type":"x-block-dirty-bitmap-disable", "data":{
>     "node":"$node1", "name":"check1"}},
>    {"type":"x-block-dirty-bitmap-disable", "data":{
>     "node":"$node2", "name":"check1"}},
>    {"type":"block-dirty-bitmap-add", "data":{
>     "node":"$node1", "name":"check2", "persistent":true}},
>    {"type":"block-dirty-bitmap-add", "data":{
>     "node":"$node2", "name":"check2", "persistent":true}}
>   ]}}
> {"execute":"nbd-server-add",
>   "arguments":{"device":"backup-sdc", "name":"sdc"}}
> {"execute":"nbd-server-add",
>   "arguments":{"device":"backup-sdd", "name":"sdd"}}
> {"execute":"x-nbd-server-add-bitmap",
>   "arguments":{"name":"sdc", "bitmap":"backup-sdc"}}
> {"execute":"x-nbd-server-add-bitmap",
>   "arguments":{"name":"sdd", "bitmap":"backup-sdd"}}
>
> Two things stand out here, different from the earlier full backup. First
> is that libvirt is now creating a temporary non-persistent bitmap,
> merging all data fom check1 into the temporary, then freezing writes
> into the temporary bitmap during the transaction, and telling NBD to
> expose the bitmap to clients. The second is that since we want this
> backup to start a new checkpoint, we disable the old bitmap and create a
> new one. The two additions are independent - it is possible to create an
> incremental backup [<incremental> in backup XML]) without triggering a
> new checkpoint [presence of non-null checkpoint XML].  In fact, taking
> an incremental backup without creating a checkpoint is effectively doing
> differential backups, where multiple backups started at different times
> each contain all cumulative changes since the same original point in
> time, such that later backups are larger than earlier backups, but you
> no longer have to chain those backups to one another to reconstruct the
> state in any one of the backups).
>
> Now that the pull-model backup job is running, we want to scrape the
> data off the NBD server.  Merely reading nbd://localhost:10809/sdc will
> read the full contents of the disk - but that defeats the purpose of
> using the checkpoint in the first place to reduce the amount of data to
> be backed up. So, let's modify our image-scraping loop from the first
> example, to now have one client utilizing the x-dirty-bitmap command
> line extension to drive other clients.  Note: that extension is marked
> experimental in part because it has screwy semantics: if you use it, you
> can't reliably read any data from the NBD server, but instead can
> interpret 'qemu-img map' output by treating any "data":false lines as
> dirty, and "data":true entries as unchanged.
>
> $ image_opts=driver=nbd,export=sdc,server.type=inet,
> $ image_opts+=server.host=localhost,server.port=10809,
> $ image_opts+=x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc
> $ $qemu_img create -f qcow2 inc12.img $size_of_orig1
> $ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \
>    inc12.img
> $ while read line; do
>    [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.false.* ]] ||
>      continue
>    start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]}
>    qemu-io -C -c "r $start $len" -f qcow2 inc12.img
> done < <($qemu_img map --output=json --image-opts
>
> $image_optsdriver=nbd,export=sdc,server.type=inet,server.host=localhost,server.port=10809,x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc)
> $ $qemu_img rebase -u -f qcow2 -b '' inc12.img
>
> As captured, inc12.img is an incomplete qcow2 file (it only includes
> clusters touched by the guest since the last incremental or full
> backup); but since we output into a qcow2 file, we can easily repair the
> damage:
>
> $ $qemu_img rebase -u -f qcow2 -F qcow2 -b full1.img inc12.img
>
> creating the qcow2 chain 'full1.img <- inc12.img' that contains
> identical guest-visible contents as would be present in a full backup
> done at the same moment.
>
> Of course, with the backups now captured, we clean up:
>
> $ $virsh backup-end $dom 1
> Backup id 1 completed
> $ rm scratch1.img scratch2.img
>
> and this time, virDomainBackupEnd() had to do one additional bit of work
> to delete the temporary bitmaps:
>
> {"execute":"nbd-server-remove",
>   "arguments":{"name":"sdc"}}
> {"execute":"nbd-server-remove",
>   "arguments":{"name":"sdd"}}
> {"execute":"nbd-server-stop"}
> {"execute":"block-job-cancel",
>   "arguments":{"device":"backup-sdc"}}
> {"execute":"block-job-cancel",
>   "arguments":{"device":"backup-sdd"}}
> {"execute":"blockdev-del",
>   "arguments":{"node-name":"backup-sdc"}}
> {"execute":"blockdev-del",
>   "arguments":{"node-name":"backup-sdd"}}
> {"execute":"block-dirty-bitmap-remove",
>   "arguments":{"node":"$node1", "name":"backup-sdc"}}
> {"execute":"block-dirty-bitmap-remove",
>   "arguments":{"node":"$node2", "name":"backup-sdd"}}
>
> At this point, it should be fairly obvious that you can create more
> incremental backups, by repeatedly updating the <incremental> line in
> backup.xml, and adjusting the checkpoint XML to move on to a successive
> name.  And while incremental backups are the most common (using the
> current active checkpoint as the <incremental> when starting the next),
> the scheme is also set up to permit differential backups from any
> existing checkpoint to the current point in time (since libvirt is
> already creating a temporary bitmap as its basis for the
> x-nbd-server-add-bitmap, all it has to do is just add an appropriate
> number of x-block-dirty-bitmap-merge calls to collect all bitmaps in the
> chain from the requested checkpoint to the current checkpoint).
>
> More to come in part 3.
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20181009/17b708e4/attachment-0001.htm>