<div dir="ltr"> <div class="gmail_quote"><div dir="ltr">On Fri, Oct 5, 2018 at 7:58 AM Eric Blake <<a href="mailto:eblake@redhat.com">eblake@redhat.com</a>> wrote: </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 10/4/18 12:05 AM, Eric Blake wrote: > The following (long) email describes a portion of the work-flow of how > my proposed incremental backup APIs will work, along with the backend > QMP commands that each one executes. I will reply to this thread with > further examples (the first example is long enough to be its own email). > This is an update to a thread last posted here: > <a href="https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html" rel="noreferrer" target="_blank">https://www.redhat.com/archives/libvir-list/2018-June/msg01066.html</a> > > More to come in part 2. > - Second example: a sequence of incremental backups via pull model In the first example, we did not create a checkpoint at the time of the full pull. That means we have no way to track a delta of changes since that point in time. </blockquote><div> </div><div>Why do we want to support backup without creating a checkpoint?</div><div> </div><div>If we don't have any real use case, I suggest to always require a checkpoint.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Let's repeat the full backup (reusing the same backup.xml from before), but this time, we'll add a new parameter, a second XML file for describing the checkpoint we want to create. Actually, it was easy enough to get virsh to write the XML for me (because it was very similar to existing code in virsh that creates XML for snapshot creation): $ $virsh checkpoint-create-as --print-xml $dom check1 testing \ --diskspec sdc --diskspec sdd | tee check1.xml <domaincheckpoint> <name>check1</name> </blockquote><div> </div><div>We should use an id, not a name, even of name is name is also unique like</div><div>in most libvirt apis.</div><div> </div><div>In RHV we will use always use a UUID for this.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <description>testing</description> <disks> <disk name='sdc'/> <disk name='sdd'/> </disks> </domaincheckpoint> I had to supply two --diskspec arguments to virsh to select just the two qcow2 disks that I am using in my example (rather than every disk in the domain, which is the default when <disks> is not present). </blockquote><div> </div><div>So <disks /> is valid configuration, selecting all disks, or not having "disks" element</div><div>selects all disks?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I also picked a name (mandatory) and description (optional) to be associated with the checkpoint. The backup.xml file that we plan to reuse still mentions scratch1.img and scratch2.img as files needed for staging the pull request. However, any contents in those files could interfere with our second backup (after all, every cluster written into that file from the first backup represents a point in time that was frozen at the first backup; but our second backup will want to read the data as the guest sees it now rather than what it was at the first backup), so we MUST regenerate the scratch files. (Perhaps I should have just deleted them at the end of example 1 in my previous email, had I remembered when typing that mail). $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img Now, to begin the full backup and create a checkpoint at the same time. Also, this time around, it would be nice if the guest had a chance to freeze I/O to the disks prior to the point chosen as the checkpoint. Assuming the guest is trusted, and running the qemu guest agent (qga), we can do that with: $ $virsh fsfreeze $dom $ $virsh backup-begin $dom backup.xml check1.xml Backup id 1 started backup used description from 'backup.xml' checkpoint used description from 'check1.xml' $ $virsh fsthaw $dom </blockquote><div> </div><div>Great, this answer my (unsent) question about freeze/thaw from part 1 :-) </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> and eventually, we may decide to add a VIR_DOMAIN_BACKUP_BEGIN_QUIESCE flag to combine those three steps into a single API (matching what we've done on some other existing API). In other words, the sequence of QMP operations performed during virDomainBackupBegin are quick enough that they won't stall a freeze operation (at least Windows is picky if you stall a freeze operation longer than 10 seconds). </blockquote><div> </div><div>We use fsFreeze/fsThaw directly in RHV since we need to support external</div><div>snapshots (e.g. ceph), so we don't need this functionality, but it sounds good</div><div>idea to make it work like snapshot.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> The tweaked $virsh backup-begin now results in a call to: virDomainBackupBegin(dom, "<domainbackup ...>", "<domaincheckpoint ...", 0) and in turn libvirt makes a similar sequence of QMP calls as before, with a slight modification in the middle: {"execute":"nbd-server-start",... {"execute":"blockdev-add",... </blockquote><div> </div><div>This does not work yet for network disks like "rbd" and "glusterfs"</div><div>does it mean that they will not be supported for backup?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> {"execute":"transaction", "arguments":{"actions":[ {"type":"blockdev-backup", "data":{ "device":"$node1", "target":"backup-sdc", "sync":"none", "job-id":"backup-sdc" }}, {"type":"blockdev-backup", "data":{ "device":"$node2", "target":"backup-sdd", "sync":"none", "job-id":"backup-sdd" }} {"type":"block-dirty-bitmap-add", "data":{ "node":"$node1", "name":"check1", "persistent":true}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node2", "name":"check1", "persistent":true}} ]}} {"execute":"nbd-server-add",... </blockquote><div> </div><div> </div><div>What if this sequence fail in the middle? will libvirt handle all failures</div><div>and rollback to the previous state?</div><div> </div><div>What is the semantics of "execute": "transaction"? does it mean that qemu</div><div>will handle all possible failures in one of the actions?</div><div> </div><div>(Will continue later)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> The only change was adding more actions to the "transaction" command - in addition to kicking off the fleece image in the scratch nodes, it ALSO added a persistent bitmap to each of the original images, to track all changes made after the point of the transaction. The bitmaps are persistent - at this point (well, it's better if you wait until after backup-end), you could shut the guest down and restart it, and libvirt will still remember that the checkpoint exists, and qemu will continue track guest writes via the bitmap. However, the backup job itself is currently live-only, and shutting down the guest while a backup operation is in effect will lose track of the backup job. What that really means is that if the guest shuts down, your current backup job is hosed (you cannot ever get back the point-in-time data from your API request - as your next API request will be a new point in time) - but you have not permanently ruined the guest, and your recovery is to just start a new backup. Pulling the data out from the backup is unchanged from example 1; virsh backup-dumpxml will show details about the job (yes, the job id is still 1 for now), and when ready, virsh backup-end will end the job and gracefully take down the NBD server with no difference in QMP commands from before. Thus, the creation of a checkpoint didn't change any of the fundamentals of capturing the current backup, but rather is in preparation for the next step. $ $virsh backup-end $dom 1 Backup id 1 completed $ rm scratch1.img scratch2.img [We have not yet designed how qemu bitmaps will interact with external snapshots - but I see two likely scenarios: 1. Down the road, I add a virDomainSnapshotCheckpointCreateXML() API, which adds a checkpointXML parameter but otherwise behaves like the existing virDomainSnapshotCreateXML - if that API is added in a different release than my current API proposals, that's yet another libvirt.so rebase to pickup the new API. 2. My current proposal of virDomainBackupBegin(dom, "<domainbackup>", "<domaincheckpoint>", flags) could instead be tweaked to a single XML parameter, virDomainBackupBegin(dom, " <domainbackup> <domaincheckpoint> ... </domaincheckpoint> </domainbackup>", flags) prior to adding my APIs to libvirt 4.9, then down the road, we also tweak <domainsnapshot> to take an optional <domaincheckpoint> sub-element, and thus reuse the existing virDomainSnapshotCreateXML() to now also create checkpoints without a further API addition. Speak up now if you have a preference between the two ideas] Now that we have concluded the full backup and created a checkpoint, we can do more things with the checkpoint (it is persistent, after all). For example: $ $virsh checkpoint-list $dom Name Creation Time -------------------------------------------- check1 2018-10-04 15:02:24 -0500 called virDomainListCheckpoints(dom, &array, 0) under the hood to get a list of virDomainCheckpointPtr objects, then called virDomainCheckpointGetXMLDesc(array[0], 0) to scrape the XML describing that checkpoint in order to display information. Or another approach, using virDomainCheckpointGetXMLDesc(virDomainCheckpointCurrent(dom, 0), 0): $ $virsh checkpoint-current $dom | head <domaincheckpoint> <name>check1</name> <description>testing</description> <creationTime>1538683344</creationTime> <disks> <disk name='vda' checkpoint='no'/> <disk name='sdc' checkpoint='bitmap' bitmap='check1'/> <disk name='sdd' checkpoint='bitmap' bitmap='check1'/> </disks> <domain type='kvm'> which shows the current checkpoint (that is, the checkpoint owning the bitmap that is still receiving live updates), and which bitmap names in the qcow2 files are in use. For convenience, it also recorded the full <domain> description at the time the checkpoint was captured (I used head to limit the size of this email), so that if you later hot-plug things, you still have a record of what state the machine had at the time the checkpoint was created. The XML output of a checkpoint description is normally static, but sometimes it is useful to know an approximate size of the guest data that has been dirtied since a checkpoint was created (a dynamic value that grows as a guest dirties more clusters). For that, it makes sense to have a flag to request the dynamic data; it's also useful to have a flag that suppresses the (length) <domain> output: $ $virsh checkpoint-current $dom --size --no-domain <domaincheckpoint> <name>check1</name> <description>testing</description> <creationTime>1538683344</creationTime> <disks> <disk name='vda' checkpoint='no'/> <disk name='sdc' checkpoint='bitmap' bitmap='check1' size='1048576'/> <disk name='sdd' checkpoint='bitmap' bitmap='check1' size='65536'/> </disks> </domaincheckpoint> This maps to virDomainCheckpointGetXMLDesc(chk, VIR_DOMAIN_CHECKPOINT_XML_NO_DOMAIN | VIR_DOMAIN_CHECKPOINT_XML_SIZE). Under the hood, libvirt calls {"execute":"query-block"} and converts the bitmap size reported by qemu into an estimate of the number of bytes that would be required if you were to start a backup from that checkpoint right now. Note that the result is just an estimate of the storage taken by guest-visible data; you'll probably want to use 'qemu-img measure' to convert that into a size of how much a matching qcow2 image would require when metadata is added in; also remember that the number is constantly growing as the guest writes and causes more of the image to become dirty. But having a feel for how much has changed can be useful for determining if continuing a chain of incremental backups still makes more sense, or if enough of the guest data has changed that doing a full backup is smarter; it is also useful for preallocating how much storage you will need for an incremental backup. Technically, libvirt mapping that a checkpoint size request to a single {"execute":"query-block"} works only when querying the size of the current bitmap. The command also works when querying the cumulative size since an older checkpoint, but under the hood, libvirt must juggle things to create a temporary bitmap, call a few x-block-dirty-bitmap-merge, query the size of that temporary bitmap, then clean things back up again (after all, size(A) + size(B) >= size(A|B), depending on how many clusters were touched during both A and B's tracking of dirty clusters). Again, a nice benefit of having libvirt manage multiple qemu bitmaps under a single libvirt API. Of course, the real reason we created a checkpoint with our full backup is that we want to take an incremental backup next, rather than repeatedly taking full backups. For this, we need a one-line modification to our backup XML to add an <incremental> element; we also want to update our checkpoint XML to start yet another checkpoint when we run our first incremental backup. $ cat > backup.xml <<EOF <domainbackup mode='pull'> <server transport='tcp' name='localhost' port='10809'/> <incremental>check1</incremental> <disks> <disk name='$orig1' type='file'> <scratch file='$PWD/scratch1.img'/> </disk> <disk name='sdd' type='file'> <scratch file='$PWD/scratch2.img'/> </disk> </disks> </domainbackup> EOF $ $virsh checkpoint-create-as --print-xml $dom check2 \ --diskspec sdc --diskspec sdd | tee check2.xml <domaincheckpoint> <name>check2</name> <disks> <disk name='sdc'/> <disk name='sdd'/> </disks> </domaincheckpoint> $ $qemu_img create -f qcow2 -b $orig1 -F qcow2 scratch1.img $ $qemu_img create -f qcow2 -b $orig2 -F qcow2 scratch2.img And again, it's time to kick off the backup job: $ $virsh backup-begin $dom backup.xml check2.xml Backup id 1 started backup used description from 'backup.xml' checkpoint used description from 'check2.xml' This time, the incremental backup causes libvirt to do a bit more work under the hood: {"execute":"nbd-server-start", "arguments":{"addr":{"type":"inet", "data":{"host":"localhost", "port":"10809"}}}} {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"backup-sdc", "file":{"driver":"file", "filename":"$PWD/scratch1.img"}, "backing":"'$node1'"}} {"execute":"blockdev-add", "arguments":{"driver":"qcow2", "node-name":"backup-sdd", "file":{"driver":"file", "filename":"$PWD/scratch2.img"}, "backing":"'$node2'"}} {"execute":"block-dirty-bitmap-add", "arguments":{"node":"$node1", "name":"backup-sdc"}} {"execute":"x-block-dirty-bitmap-merge", "arguments":{"node":"$node1", "src_name":"check1", "dst_name":"backup-sdc"}}' {"execute":"block-dirty-bitmap-add", "arguments":{"node":"$node2", "name":"backup-sdd"}} {"execute":"x-block-dirty-bitmap-merge", "arguments":{"node":"$node2", "src_name":"check1", "dst_name":"backup-sdd"}}' {"execute":"transaction", "arguments":{"actions":[ {"type":"blockdev-backup", "data":{ "device":"$node1", "target":"backup-sdc", "sync":"none", "job-id":"backup-sdc" }}, {"type":"blockdev-backup", "data":{ "device":"$node2", "target":"backup-sdd", "sync":"none", "job-id":"backup-sdd" }}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node1", "name":"backup-sdc"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node2", "name":"backup-sdd"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node1", "name":"check1"}}, {"type":"x-block-dirty-bitmap-disable", "data":{ "node":"$node2", "name":"check1"}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node1", "name":"check2", "persistent":true}}, {"type":"block-dirty-bitmap-add", "data":{ "node":"$node2", "name":"check2", "persistent":true}} ]}} {"execute":"nbd-server-add", "arguments":{"device":"backup-sdc", "name":"sdc"}} {"execute":"nbd-server-add", "arguments":{"device":"backup-sdd", "name":"sdd"}} {"execute":"x-nbd-server-add-bitmap", "arguments":{"name":"sdc", "bitmap":"backup-sdc"}} {"execute":"x-nbd-server-add-bitmap", "arguments":{"name":"sdd", "bitmap":"backup-sdd"}} Two things stand out here, different from the earlier full backup. First is that libvirt is now creating a temporary non-persistent bitmap, merging all data fom check1 into the temporary, then freezing writes into the temporary bitmap during the transaction, and telling NBD to expose the bitmap to clients. The second is that since we want this backup to start a new checkpoint, we disable the old bitmap and create a new one. The two additions are independent - it is possible to create an incremental backup [<incremental> in backup XML]) without triggering a new checkpoint [presence of non-null checkpoint XML]. In fact, taking an incremental backup without creating a checkpoint is effectively doing differential backups, where multiple backups started at different times each contain all cumulative changes since the same original point in time, such that later backups are larger than earlier backups, but you no longer have to chain those backups to one another to reconstruct the state in any one of the backups). Now that the pull-model backup job is running, we want to scrape the data off the NBD server. Merely reading nbd://localhost:10809/sdc will read the full contents of the disk - but that defeats the purpose of using the checkpoint in the first place to reduce the amount of data to be backed up. So, let's modify our image-scraping loop from the first example, to now have one client utilizing the x-dirty-bitmap command line extension to drive other clients. Note: that extension is marked experimental in part because it has screwy semantics: if you use it, you can't reliably read any data from the NBD server, but instead can interpret 'qemu-img map' output by treating any "data":false lines as dirty, and "data":true entries as unchanged. $ image_opts=driver=nbd,export=sdc,server.type=inet, $ image_opts+=server.host=localhost,server.port=10809, $ image_opts+=x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc $ $qemu_img create -f qcow2 inc12.img $size_of_orig1 $ $qemu_img rebase -u -f qcow2 -F raw -b nbd://localhost:10809/sdc \ inc12.img $ while read line; do [[ $line =~ .*start.:.([0-9]*).*length.:.([0-9]*).*data.:.false.* ]] || continue start=${BASH_REMATCH[1]} len=${BASH_REMATCH[2]} qemu-io -C -c "r $start $len" -f qcow2 inc12.img done < <($qemu_img map --output=json --image-opts $image_optsdriver=nbd,export=sdc,server.type=inet,server.host=localhost,server.port=10809,x-dirty-bitmap=qemu:dirty-bitmap:backup-sdc) $ $qemu_img rebase -u -f qcow2 -b '' inc12.img As captured, inc12.img is an incomplete qcow2 file (it only includes clusters touched by the guest since the last incremental or full backup); but since we output into a qcow2 file, we can easily repair the damage: $ $qemu_img rebase -u -f qcow2 -F qcow2 -b full1.img inc12.img creating the qcow2 chain 'full1.img <- inc12.img' that contains identical guest-visible contents as would be present in a full backup done at the same moment. Of course, with the backups now captured, we clean up: $ $virsh backup-end $dom 1 Backup id 1 completed $ rm scratch1.img scratch2.img and this time, virDomainBackupEnd() had to do one additional bit of work to delete the temporary bitmaps: {"execute":"nbd-server-remove", "arguments":{"name":"sdc"}} {"execute":"nbd-server-remove", "arguments":{"name":"sdd"}} {"execute":"nbd-server-stop"} {"execute":"block-job-cancel", "arguments":{"device":"backup-sdc"}} {"execute":"block-job-cancel", "arguments":{"device":"backup-sdd"}} {"execute":"blockdev-del", "arguments":{"node-name":"backup-sdc"}} {"execute":"blockdev-del", "arguments":{"node-name":"backup-sdd"}} {"execute":"block-dirty-bitmap-remove", "arguments":{"node":"$node1", "name":"backup-sdc"}} {"execute":"block-dirty-bitmap-remove", "arguments":{"node":"$node2", "name":"backup-sdd"}} At this point, it should be fairly obvious that you can create more incremental backups, by repeatedly updating the <incremental> line in backup.xml, and adjusting the checkpoint XML to move on to a successive name. And while incremental backups are the most common (using the current active checkpoint as the <incremental> when starting the next), the scheme is also set up to permit differential backups from any existing checkpoint to the current point in time (since libvirt is already creating a temporary bitmap as its basis for the x-nbd-server-add-bitmap, all it has to do is just add an appropriate number of x-block-dirty-bitmap-merge calls to collect all bitmaps in the chain from the requested checkpoint to the current checkpoint). More to come in part 3. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: <a href="http://qemu.org" rel="noreferrer" target="_blank">qemu.org</a> | <a href="http://libvirt.org" rel="noreferrer" target="_blank">libvirt.org</a> </blockquote></div></div>