[PATCH 32/32] kbase: Add document outlining internals of incremental backup in qemu

Sun Jun 21 23:40:18 UTC 2020

On Mon, Jun 15, 2020 at 8:13 PM Peter Krempa <pkrempa at redhat.com> wrote:
>
> Outline the basics and how to integrate with externally created
> overlays. Other topics will continue later.

Thanks, this is very helpful!

> Signed-off-by: Peter Krempa <pkrempa at redhat.com>
> ---
>  docs/kbase.html.in                        |   3 +
>  docs/kbase/incrementalbackupinternals.rst | 210 ++++++++++++++++++++++
>  2 files changed, 213 insertions(+)
>  create mode 100644 docs/kbase/incrementalbackupinternals.rst
>
> diff --git a/docs/kbase.html.in b/docs/kbase.html.in
> index c586e0f676..4257e52b7e 100644
> --- a/docs/kbase.html.in
> +++ b/docs/kbase.html.in
> @@ -36,6 +36,9 @@
>
>          <dt><a href="kbase/virtiofs.html">Virtio-FS</a></dt>
>          <dd>Share a filesystem between the guest and the host</dd>
> +
> +        <dt><a href="kbase/incrementalbackupinternals.html">Incremental backup internals</a></dt>
> +        <dd>Incremental backup implementation details relevant for users</dd>
>        </dl>
>      </div>
>
> diff --git a/docs/kbase/incrementalbackupinternals.rst b/docs/kbase/incrementalbackupinternals.rst
> new file mode 100644
> index 0000000000..adf12002d2
> --- /dev/null
> +++ b/docs/kbase/incrementalbackupinternals.rst
> @@ -0,0 +1,210 @@
> +================================================
> +Internals of incremental backup handling in qemu
> +================================================
> +
> +.. contents::
> +
> +Libvirt's implementation of incremental backups in the ``qemu`` driver uses
> +qemu's ``block-dirty-bitmaps`` under the hood to track the guest visible disk
> +state changes correspoiding to the points in time described by a libvirt
> +checkpoint.
> +
> +There are some semantical implications how libvirt creates and manages the
> +bitmaps which de-facto become API as they are written into the disk images and
> +this document will try to sumarize them.
> +
> +Glossary
> +========
> +
> +Checkpoint
> +
> +    A libvirt object which represents a named point in time of the life of the
> +    vm where libvirt tracks writes the VM has done and allows then a backup of
> +    block which changed. Note that state of the VM memory is _not_ captured.
> +
> +    A checkpoint can be created either explicitly via the corresponding API
> +    which isn't very useful or is created as part of creating an
> +    incremental or full backup of the VM using the ``virDomainBackupBegin`` API
> +    which allows a next backup to only copy the differences.
> +
> +Backup
> +
> +    A copy of either all blocks of selected disks (full backup) or blocks changed
> +    since a checkpoint (incremental backup) at the time the backup job was
> +    started. (Blocks modified while the backup job is running are not part of the
> +    backup!)
> +
> +Snapshot
> +
> +    Similarly to a checkpoint it's a point in time in the lifecycle of the VM
> +    but the state of the VM including memory is captured at that point allowing
> +    returning to the state later.
> +
> +Blockjob
> +
> +    A long running job which modifies the shape and/or location of the disk
> +    backing chain (images storing the disk contents). Libvirt supports
> +    ``block pull`` where data is moved up the chain towards the active layer,
> +    ``block commit`` where data is moved down the chain towards the base/oldest
> +    image. These blockjobs always remove images from the backing chain. Lastly
> +    ``block copy`` where image is moved to a different location (and possibly
> +    collapsed moving all of the data into the new location into the one image).
> +
> +block-dirty-bitmap (bitmap)
> +
> +    A data structure in qemu tracking which blocks were written by the guest
> +    OS since the bitmap was created.
> +
> +Relationships of bitmaps, checkpoints and VM disks
> +==================================================
> +
> +When a checkpoint is created libvirt creates a block-dirty-bitmap for every
> +configured VM disk named the same way as chcheckpoint. The bitmap is actively
> +recording which blocks were changed by the guest OS from that point on. Other
> +bitmaps are not impacted by any way as they are self-contained:
> +
> +::
> +
> + +----------------+       +----------------+
> + | disk: vda      |       | disk: vdb      |
> + +--------+-------+       +--------+-------+
> +          |                        |
> + +--------v-------+       +--------v-------+
> + | vda-1.qcow2    |       | vdb-1.qcow2    |
> + |                |       |                |
> + | bitmaps: chk-a |       | bitmaps: chk-a |
> + |          chk-b |       |          chk-b |
> + |                |       |                |
> + +----------------+       +----------------+
> +
> +Bitmaps are created at the same time to track changes to all disks in sync and
> +are active and persisted in the QCOW2 image. Oter formats currently don't
> +support this feature.
> +
> +Modification of bitmaps outside of libvirt is not recommended, but when adrering
> +to the same semantics which the document will describe it should be safe to do
> +so but obviously we can't guarantee that.
> +
> +
> +Integration with external snapshots
> +===================================
> +
> +Handling of bitmaps
> +-------------------
> +
> +Creating an external snapshot involves adding a new layer to the backing chain
> +on top of the previous chain. In this step there are no new bitmaps created by
> +default, which would mean that backups become impossible after this step.
> +
> +To prevent this from happening we need to re-create the active bitmaps in the
> +new top/active layer of the backing chain which allows us to continue tracking
> +the changes with same granularity as before and also allows libvirt to stitch
> +together all the corresponding bitmaps to do a backup acorss snapshots.
> +
> +After taking a snapshot of the ``vda`` disk from the example above placed into
> +``vda-2.qcow2`` the following topology will be created:
> +
> +::
> +
> +   +----------------+
> +   | disk: vda      |
> +   +-------+--------+
> +           |
> +   +-------v--------+    +----------------+
> +   | vda-2.qcow2    |    | vda-1.qcow2    |
> +   |                |    |                |
> +   | bitmaps: chk-a +----> bitmaps: chk-a |
> +   |          chk-b |    |          chk-b |
> +   |                |    |                |
> +   +----------------+    +----------------+
> +
> +Checking bitmap health
> +----------------------
> +
> +QEMU optimizes disk writes by only updating the bitmaps in certain cases. This
> +also can cause problems in cases when e.g. QEMU crashes.
> +
> +For a chain of bitmaps corresponding in a backing chain to be considered valid
> +and eligible for use with ``virDomainBackupBegin`` it must conform to the
> +following rules:
> +
> +1) Top image must contain the bitmap
> +2) If any of the backing images in the chain contain the bitmap too all
> +   contiguous images must have the bitmap (no gaps)
> +3) all of the above bitmaps must be marked as active
> +   (``auto`` flag in ``qemu-img`` output, ``recording`` in qemu)
> +4) none of the above bitmaps can be inconsistent
> +   (``in-use`` flag in ``qemu-img`` provided that it's not used on image which
> +   is currently in use by a qemu instance, or ``inconsistent`` in qemu)

Can you add a chapter of about the old format and how it was different
from the new
format?

Looks like the differences are:
- all bitmaps are always active, so no need to enable or disable them
- based on next section, new snapshost contain all the bitmaps from
the previous snapshots
  (since all of them active and we copy all of them to the new snapshot)

How qemu knows which bitmap should track changes if all bitmaps are active?

When libvirt starts a VM, it knows nothing about the checkpoints. We
define the checkpoints
right before the first backup after starting a VM. So both libvirt and
qemu know nothing about
the bitmaps at this point.

Do you expect to have the checkpoints defined before the guest is started?

If I understand this correctly, if we do:

- create base image
- do full backup (check-1)
- do incremental backup 1 (check-2)
- create snapshot-1
- do incremental backup 2 (check-3)
- do incremental backup 3 (check-4)
- create snapshot-2
- do incremental backup 4 (check-5)

This will be the image structure:

- base image
    - check-1
    - check-2
- snapshot-1
    - check-1
    - check-2
    - check-3
    - check-4
- snapshot-2
    - check-1
    - check-2
    - check-3
    - check-4
    - check-5

So we are duplicating bitmaps that have no content in all snapshot?

Why not copy only the last (current) bitmap?

- base image
    - check-1
    - check-2
- snapshot-1
    - check-2
    - check-3
    - check-4
- snapshot-2
    - check-4
    - check-5

> +::
> +
> + # check that image has bitmaps
> +  $ qemu-img info vda-1.qcow2
> +   image: vda-1.qcow2
> +   file format: qcow2
> +   virtual size: 100 MiB (104857600 bytes)
> +   disk size: 220 KiB
> +   cluster_size: 65536
> +   Format specific information:
> +       compat: 1.1
> +       compression type: zlib
> +       lazy refcounts: false
> +       bitmaps:
> +           [0]:
> +               flags:
> +                   [0]: in-use
> +                   [1]: auto
> +               name: chk-a
> +               granularity: 65536
> +           [1]:
> +               flags:
> +                   [0]: auto
> +               name: chk-b
> +               granularity: 65536
> +       refcount bits: 16
> +       corrupt: false
> +
> +(See also the ``qemuBlockBitmapChainIsValid`` helper method in
> +``src/qemu/qemu_block.c``)

Looks like oVirt needs to implement this a well, otherwise we will waste time
creating bitmaps on a snapshot when libvirt will fail the backup later since
there was an inconsistent or disabled bitmap.

> +Creating external checkpoints manually
> +--------------------------------------
> +
> +To create the same topology outside of libvirt (e.g when doing snapshots offline)
> +a new ``qemu-img`` which supports the ``bitmap`` subcomand is necessary. The
> +following algorithm then ensures that the new image after snapshot will work
> +with backups (note that ``jq`` is a JSON processor):
> +
> +::
> +
> +  # arguments
> +  SNAP_IMG="vda-2.qcow2"
> +  BACKING_IMG="vda-1.qcow2"
> +
> +  # constants - snapshots and bitmaps work only with qcow2
> +  SNAP_FMT="qcow2"
> +  BACKING_IMG_FMT="qcow2"
> +
> +  # create snapshot overlay
> +  qemu-img create -f "$SNAP_FMT" -F "$BACKING_IMG_FMT" -b "$BACKING_IMG" "$SNAP_IMG"
> +
> +  BACKING_IMG_INFO=$(qemu-img info --output=json -f "$BACKING_IMG_FMT" "$BACKING_IMG")
> +  BACKING_BITMAPS=$(jq '."format-specific".data.bitmaps' <<< "$BACKING_IMG_INFO")
> +
> +  if [ "x$BACKING_BITMAPS" == "xnull" ]; then
> +      exit 0
> +  fi
> +
> +  for BACKING_BITMAP_ in $(jq -c '.[]' <<< "$BACKING_BITMAPS"); do
> +      BITMAP_FLAGS=$(jq -c -r '.flags[]' <<< "$BACKING_BITMAP_")
> +      BITMAP_NAME=$(jq -r '.name' <<< "$BACKING_BITMAP_")
> +
> +      if grep 'in-use' <<< "$BITMAP_FLAGS" ||
> +         grep -v 'auto' <<< "$BITMAP_FLAGS"; then
> +         continue
> +      fi
> +
> +      qemu-img bitmap -f "$SNAP_FMT" "$SNAP_IMG" --add "$BITMAP_NAME"

So what we have to do is:

- get a list of bitmaps that should be in this disk from oVirt engine
- get list of bitmaps with "auto" flag and without the "in-use" flag
in the backing file
- if the lists do not match we can delete all bitmaps and relevant
checkpoints on engine
  side, since the next backup will fail anyway.
- if the lists match, maybe verify that bitmaps are not missing in
lower layers (gaps)
- if the lists match, create empty bitmap with the same name and
granularity in the top image

What do we have to do for the old format? or we just not implement
this until we get the new
format?

Can you add a section explaining how bitmaps should be handled in blockCommit?
We may need to implement this for cold merge using qemu-img commit.

Looking at the current structure, it looks like we have to do:

1. commit top layer to base layer (we support only one layer commit)
2. merge all bitmaps from top layer to base layer
3. copy all bitmaps in top layer that are not in base layer to base layer

Nir

> +
> +  done
> --
> 2.26.2
>