Summary on new backup interfaces in QEMU

Tue Mar 15 17:57:01 UTC 2022

Hi all!

Here I want to summarize new interfaces and use cases for backup in QEMU.

TODO for me: convert this into good rst documentation in docs/.

OK, let's begin.

First, note that drive-backup qmp command is deprecated.

Next, some terminology:

push backup: the whole process is inside QEMU process, also may be called "internal backup"

pull backup: QEMU only exports a kind of snapshot (for example by NBD), and third party software reads this export and stores it somehow, also called "external backup"

copy-before-write operations: We usually do backup of active disk, guest is running and may write to the disk during the process of backup. When guest wants to rewrite data region which is not backed up yet, we must stop this guest write, and copy original data to somewhere before continuing guest write. That's a copy-before-write operation.

image-fleecing: the technique that allows to export a "snapshotted" state of the active disk with help of copy-before-write operations. We create a temporary image - target for copy-before-write operations, and provide an interface to the user to read the "snapshotted" state. And for read, we do read from temporary image the data which is already changed in original active disk, and we read unchanged data directly from active disk. The temporary image itself is also called "reverse delta" or "reversed delta".

== Simple push backup ==

Just use blockdev-backup, nothing new here. I just note some technical details, that are relatively new:

1. First, backup job inserts copy-before-write filter above source disk, to do copy-before-write operation.
2. Created copy-before-write filter shares internal block-copy state with backup job, so they work in collaboration, to not copy same things twice.

== Full pull backup ==

Assume, we are going to do incremental backup in future, so we also need to create a dirty bitmap, to track dirtiness of active disk since full backup.

1. Create empty temporary image for fleecing. It must be of the same size as active disk. It's not necessary to be qcow2, and if it's a qcow2, you shouldn't make the original active disk a backing file for the new temporary qcow2 image (it was necessary in old fleecing scheme).

Example:
   qemu-img create -f qcow2 temp.qcow2 64G

2. Initialize fleecing scheme and create dirty bitmap for future incremental backup.

Assume, disk0 is an active disk, attached to qdev-id sda, to be backed up.

qmp: transaction [
    block-dirty-bitmap-add {node: disk0, name: bitmap0, persistent: true}
    blockdev-add* {node-name: tmp-protocol, driver: file, filename: temp.qcow2}
    blockdev-add {node-name: tmp, driver: qcow2, file: tmp-protocol}
    blockdev-add {node-name: cbw, driver: copy-before-write, file: disk0, target: tmp}
    blockdev-replace** {parent-type: qdev, qdev-id: sda, new-child: cbw}
    blockdev-add {node-name: acc, driver: snapshot-access, file: cbw}
]

qmp: nbd-server-start {...}
qmp: nbd-server-add {device: acc, ...}

This way we create the following block-graph:

[guest]                   [NBD export]
    |                            |
    | root                       | root
    v                 file       v
[copy-before-write]<------[snapshot-access]
    |           |
    | file      | target
    v           v
[active-disk] [temp.qcow2]

* "[PATCH 0/2] blockdev-add transaction" series needed for this
** "[PATCH v3 00/11] blockdev-replace" series needed for this

Note additional useful options for copy-before-write filter:

"[PATCH 0/3] block: copy-before-write: on-cbw-error behavior" provides possibility of option on-cbw-error=break-snapshot, which means that on failure of CBW operation we will not break guest write, but instead all further reads by NBD client will fail, which formally means: break the backup process, not guest write.

"[PATCH 0/4] block: copy-before-write: cbw-timeout" provides an option cbw-timeout, to set a timeout for CBW operations. That's very useful to avoid guest stuck.

3. Now third party backup tool can read data from NBD export

NBD_CMD_TRIM (discard) operation is supported on the export, it has the following effects:

1. discard this data from temp image, if it is stored here
2. avoid further copy-before-write operation (guest is free to rewrite corresponding data with no extra latency)
3. all further read requests from discarded areas by NBD client will fail

So, NBD client may discard regions that are already backed up to avoid extra latency for guest writes and to free disk space on the host.

Possible TODO here is to implement NBD protocol extension, that allows to READ & DISCARD in command. In this case we avoid extra command in the wire, but lose possibility of retrying the READ operation if it failed.

4. After backup is complete, we should destroy the fleecing scheme:

qmp: nbd-server-stop

qmp: blockdev-del {node-name: acc}
qmp: blockdev-replace {parent-type: qdev, qdev-id: sda, new-child: disk0}
qmp: blockdev-del {node-name: cbw}
qmp: blockdev-del {node-name: tmp}
qmp: blockdev-del {node-name: tmp-protocol}

5. If backup failed, we should remove created dirty bitmap:

qmp: block-dirty-bitmap-remove {node: disk0, name: bitmap0}

== Incremental pull backup ==

OK, now we have a bitmap called bitmap0, and want to do incremental backup, accordingly to that bitmap. In short, we want:

  - create a new bitmap to continue dirty tracking for next incremental backup
  - export "snapshotted" state of disk0 through NBD
  - export "frozen" bitmap, so that external tool know what to copy

Mostly, all points remains the same, let's go through:

1. Create empty temporary image for fleecing -- same as for full backup, no difference

2. Initialize fleecing scheme and create dirty bitmap for future incremental backup.

qmp: transaction [
    block-dirty-bitmap-add {node: disk0, name: bitmap1, persistent: true}
    block-dirty-bitmap-disable {node: disk0, name: bitmap0}
    blockdev-add {node-name: tmp-protocol, driver: file, filename: temp.qcow2}
    blockdev-add {node-name: tmp, driver: qcow2, file: tmp-protocol}
    blockdev-add {node-name: cbw, driver: copy-before-write, file: disk0, target: tmp, bitmap: {node: disk0, name: bitmap0}}
    blockdev-replace {parent-type: qdev, qdev-id: sda, new-child: cbw}
    blockdev-add {node-name: acc, driver: snapshot-access, file: cbw}
]

qmp: nbd-server-start {...}
qmp: block-export-add {type: nbd, node-name: acc, bitmaps: [{node: disk0, name: bitmap0}]}

3. Now third party backup tool can read data from NBD export

  - Client may negotiate meta contexts, to query exported dirty bitmap by NBD_BLOCK_STATUS commend
  - If client reads "not-dirty" (by bitmap0) areas, it gets an error.
  - NBD_CMD_TRIM (discard) works as for full backup, no difference

4. After backup is complete, we should destroy the fleecing scheme:

   - Same as for full backup

5. Next, we should handle dirty bitmaps:

5.1 Failure path

Merge-back bitmap1 to bitmap0 and continue tracking in bitmap0:

qmp: transaction [
     block-dirty-bitmap-enable {node: disk0, name: bitmap0}
     block-dirty-bitmap-merge {node: disk0, target: bitmap0, bitmaps: ['bitmap1']}
     block-dirty-bitmap-remove {node: disk0, name: bitmap1}
]

5.2 Success path

We have two possible user scenarios on success:

5.2.1 Continue tracking for next incremental backup in bitmap1

In this case, just remove bitmap0:
qmp: block-dirty-bitmap-remove {node: disk0, name: bitmap0}

Or you may not delete bitmap0 and keep it disabled, to be reused in future for differential backup (see below).

5.2.2 Continue tracking for next incremental backup in bitmap0 (assume, we always work with one bitmap and don't want any kind of differential backups, and don't associate bitmap name with stored backups)

In this case, enable and clear bitmap0, merge bitmap1 to bitmap0 and remove bitmap1:

qmp: transaction [
     block-dirty-bitmap-enable {node: disk0, name: bitmap0}
     block-dirty-bitmap-clear {node: disk0, name: bitmap0}
     block-dirty-bitmap-merge {node: disk0, target: bitmap0, bitmaps: ['bitmap1']}
     block-dirty-bitmap-remove {node: disk0, name: bitmap1}
]

== Push backup with fleecing full/incremental ==

Reasoning: the main problem of simple push backup is that guest writes may be seriously affected by copy-on-write operations, when backup target is slow. To solve this problem, we'll use the scheme like for pull backup: we create local temporary image, which is a target for copy-before-write operations, and instead of exporting the "snapshot-access" node we start internal backup from it to the target.

So, the scheme and commands looks exactly the same as for full and incremental pull backup. The only difference is that we don't need to start nbd export, but instead we should add target node to qemu and start internal backup. And good thing is that it may be done in same transaction with initializing fleecing scheme:

qmp: transaction [
     ... initialize fleecing scheme for full or incremental backup ...

# Add target node. Here is qcow2 added, but it may be nbd node or something else
     blockdev-add {node-name: target-protocol, driver: file, filename: target.qcow2}
     blockdev-add {node-name: target, driver: qcow2, file: target-protocol}

# Start backup
     blockdev-backup {device: acc, target: target, ...}
]

If it is an incremental backup, pass also bitmap parameter:

     blockdev-backup {..., bitmap: bitmap0, sync: incremental, bitmap-mode: never}

Note bitmap-mode=never: this means that backup will do nothing with bitmap0, so we have same scheme like for pull backups (handle bitmaps by hand after backup). Still, push-backup scheme may be adopted to use other bitmap modes.

What we lack here is discarding in 'acc' node after successful copying of the block to the target, to safe disk space and avoid extra copy-before-write operations. It's a TODO, should be implemented like discard-source parameter for blockdev-backup.

== Differential backups ==

I'm not fan of this idea, but I think it should be described.

Assume we have already a chain of incremental backups (represented as qcow2 chain on backup storage server, for example). They corresponds to some points in time: T0, T1, T2, T3. Assume T3 is the last backup.

If we want to create usual incremental backup, it would be diff between T3 and current time (which becomes T4).

Differential backup say: I want to make backup starting from T1 to current time. What's for? Maybe T2 and T3 was removed or somehow damaged..

How to do that in Qemu: on each incremental backup you start a new bitmap, and _keep_ old one as disabled.
This way we have bitmap0 (which presents diff between T0 and T1), bitmap1 (diff T1 T2), bitmap2 (diff T2 T3), and bitmap3 which shows diff from T3 up to current time. bitmap3 is the only enabled bitmap and others are disabled.

So, to make differential backup, use block-dirty-bitmap-merge command, to merge all bitmaps you need into one, and than use it in any backup scheme.

The drawback is that all these disabled bitmaps eat RAM. Possible solution is to not keep them in RAM, it's OK to keep them in qcow2, and load only on demand. That's not realized now and that's a TODO for thous who want differential backups.

-- 
Best regards,
Vladimir