[libvirt] [PATCH 1/8] snapshots: Avoid term 'checkpoint' for full system snapshot

Wed Jun 27 01:27:55 UTC 2018

On 06/26/2018 10:56 AM, Nir Soffer wrote:
> On Wed, Jun 13, 2018 at 7:42 PM Eric Blake <eblake at redhat.com> wrote:
> 
>> Upcoming patches plan to introduce virDomainCheckpointPtr as a new
>> object for use in incremental backups, along with documentation
>> how incremental backups differ from snapshots.  But first, we need
>> to rename any existing mention of a 'system checkpoint' to instead
>> be a 'full system state snapshot', so that we aren't overloading
>> the term checkpoint.
>>
> 
> I want to refer only to the new concept of checkpoint, compared with
> snapshot.
> 
> I think checkpoint should refer to the current snapshot. When you perform
> a backup, you should get the changed blocks in the current snapshot.

That is an incremental backup (copying only the blocks that have changed 
since some previous point in time) - and my design was that such points 
in time are named 'checkpoints', where the most recent checkpoint is the 
current checkpoint.  This is different from a snapshot (which is enough 
state that you can revert back to that point in time directly) - a 
checkpoint only carries enough information to perform an incremental 
backup rather than a rollback to earlier state.

> When you restore, you want to the restore several complete snapshots,
> and one partial snapshot, based on the backups of that snapshot.

I'm worried that overloading the term "snapshot" and/or "checkpoint" can 
make it difficult to see whether we are describing the same data motions.

You are correct that in order to roll a virtual machine back to state 
represented by a series of incremental backups will require 
reconstructing the state present on the machine at the desired point to 
roll back to.  But I'll have to read your example first to see if we're 
on the same page.

> 
> Lets try to see an example:
> 
> T1
> - user create new vm marked for incremental backup
> - system create base volume (S1)
> - system create new dirty bitmap (B1)

Why do you need a dirty bitmap on a brand new system?  By definition, if 
the VM is brand new, every sector that the guest touches will be part of 
the first incremental backup, which is no different than taking a full 
backup of every sector?  But if it makes life easier by following 
consistent patterns, I also don't see a problem with creating a first 
checkpoint at the time an image is first created (my API proposal would 
allow you to create a domain, start it in the paused state, create a 
checkpoint, and then resume the guest so that it can start executing).

> 
> T2
> - user create a snapshot
> - dirty bitmap in original snapshot deactivated (B1)
> - system create new snapshot (S2)
> - system starts new dirty bitmap in the new snapshot (B2)

I'm still worried that interactions between snapshots (where the backing 
chain grows) and bitmaps may present interesting challenges.  But what 
you are describing here is that the act of creating a snapshot (to 
enlarge the backing chain) also has the effect of creating a snapshot (a 
new point in time for tracking incremental changes since the creation of 
the snapshot).  Whether we have to copy B1 into image S2, or whether 
image S2 can get by with just bitmap B2, is an implementation detail.

> 
> T3
> - user create new checkpoint
> - system deactivate current dirty bitmap (B2)
> - system create new dirty bitmap (B3)
> - user backups data in snapshot S2 using dirty bitmap B2
> - user backups data in snapshot S1 using dirty bitmap B1

So here you are performing two incremental backups.  Note: the user can 
already backup S1 without using any new APIs, and without reference to 
bitmap B1 - that's because B1 was started when S1 was created, and 
closed out when S1 was no longer modified - but now that S1 is a 
read-only file in the backing chain, copying S1 is the same as copying 
the clusters covered by bitmap B1.

Also, my current API additions do NOT make it easy to grab just the 
incremental data covered by bitmap B1 at time T3; rather, the time to 
grab the copy of the data covered just by B1 is at time T2 when you 
create bitmap B2 (whether or not you also create file S2).  The API 
additions as I have proposed them only make it easy to grab a full 
backup of all data up to time T3 (no checkpoint as its start), an 
incremental backup of all data since T1 (checkpoint T1 as its start, 
using the merge of B1 and B2 to learn which clusters to grab), or an 
incremental backup of all data since T2 (checkpoint T2 as its start, 
using B2 to learn which clusters to grab).

If you NEED to grab an incremental snapshot whose history is NOT bounded 
by the current moment in time, then we need to rethink the operations we 
are offering via my new API.  On the bright side, since my API for 
virDomainBackupBegin() takes an XML description, we DO have the option 
of enhancing that XML to take a second point in time as the end boundary 
(it already has an optional <incremental> tag as the first point in time 
for the start boundary; or a full backup if that tag is omitted) - if we 
enhance that XML, we'd also have to figure out how to map it to the 
operations that qemu exposes.  (The blockdev-backup command makes it 
easy to grab an incremental backup ending at the current moment in time, 
by using the "sync":"none" option to a temporary scratch file so that 
further guest writes do not corrupt the data to be grabbed from that 
point in time - but it does NOT make it easy to see the state of data 
from an earlier point in time - I'll demonstrate that below).

> 
> T4
> - user create new checkpoint
> - system deactivate current dirty bitmap (B3)
> - system create new dirty bitmap (B4)
> - user backups data in snapshot S2 using dirty bitmap B3

Yes, this is similar to what was done at T3, without the complication of 
trying to grab an incremental backup whose end boundary is not the 
current moment in time.

> 
> Lets say use want to restore to state as it was in T3
> 
> This is the data kept by the backup application:
> 
> - snapshots
>    - S1
>      - checkpoints
>        - B1
>    - S2
>      - checkpoints
>        - B2
>        - B3
> 
> T5
> - user start restore to state in time T3
> - user create new disk
> - user create empty snapshot S1
> - user upload snapshot S1 data to stoage
> - user create empty snaphost disk S2
> - user upload snapshot S1 data to stoage

Presumably, this would be 'user uploads S2 to storage', not S1.  But 
restoring in this manner didn't make any use of your incremental snapshots.

Maybe what I need to do is give a more visual indication of what 
incremental backups store.

At T1, we create S1 and start populating it.  As this was a brand new 
guest, the storage starts empty. Since you mentioned B1, I'll show it 
here, even though I argued it is pointless other than for fewer 
differences from later cases:

S1: |--------|
B1: |--------|
guest sees: |--------|

At T2, the guest has written things, so we now have:

S1: |AAAA----|
B1: |XXXX----|
guest sees: |AAAA----|

where A is the contents of the data the guest has written, and X is an 
indication in the bitmap which sections are dirty.

Also at time T2, we create a snapshot S2, making S1 become a read-only 
picture of the state of the disk at T2; we also started bitmap B2 on S2 
to track what the guest does:

S1: |AAAA----| <- S2: |--------|
B1: |XXXX----|    B2: |--------|

we can copy S1 to S1.bak at any point in time now that S1 is readonly.

S1.bak: |AAAA----|

At T3, the guest has written things, so we now have:

S1: |AAAA----| <- S2: |---BBB--|
B1: |XXXX----|    B2: |---XXX--|
guest sees: |AAABBB--|

so at this point, we freeze B2 and create B3; the new 
virDomainBackupBegin() API will let us also access the following copies 
at this time:

S1: |AAAA----| <- S2: |---BBB--|
B1: |XXXX----|    B2: |---XXX--|
                   B3: |--------|

full3.bak (no checkpoint as starting point): |AAABBB--|
B2.bak (checkpoint B2 as starting point): |---BBB--|

B2.bak by itself does not match anything the guest ever saw, but you can 
string together:

S1.bak <- S2.bak

to reconstruct the state the guest saw at T3.  By T4, the guest has made 
more edits:

S1: |AAAA----| <- S2: |D--BBDD-|
B1: |XXXX----|    B2: |---XXX--|
                   B3: |X----XX-|
guest sees: |DAABBDD-|

and as before, we now create B4, and have the option of several backups 
(usually, you'll only grab the most recent incremental backup, and not 
multiple backups; this is more an exploration of what is possible):

full4.bak (no checkpoint as starting): |DAABBDD-|
S2_3.bak (B2 as starting point, covering merge of B2 and B3): |D--BBDD-|
S3.bak (B3 as starting point): |D----DD-|

Note that both (S1.bak <- S2_3.bak) and (S1.bak <- S2.bak <- S3.bak) 
result in the same reconstructed guest image at time T4.  Also note that 
reading the contents of bitmap B2 in isolation at this time is NOT 
usable (you'd get |---BBD--|, which has mixed the incremental difference 
from T2 to T3 with a subset of the difference from T3 to T4, so it NO 
LONGER REPRESENTS the state of the guest at either T2 or T3, even when 
used as an overlay on top of S1.bak).  Hence, my emphasis that it is 
usually important to create your incremental backup at the same time you 
start your next bitmap, rather than trying to do it after the fact.

Also, you are starting to see the benefits of incremental backups. 
Creating S2_3.bak doesn't necessarily need bitmaps (it results in the 
same image as you would get if you create a temporary overlay [S1 <- S2 
<- tmp], copy off S2, then live merge tmp back into S2), but both 
full4.bak and S2_3.bak had to copy more data than S3.bak.

Later on, if you want to roll back to what the guest saw at T4, you just 
have to restore [S1.bak <- S2.bak <- S3.bak] as your backing chain to 
provide the data the guest saw at that time.

> 
> John, are dirty bitmaps implemented in this way in qemu?

The whole point of the libvirt API proposals is to make it possible to 
create bitmaps in qcow2 images at the point where you are creating 
incremental backups, so that the next incremental backup can be created 
using the previous one as its base.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org