[libvirt] QEMU bitmap backup usability FAQ

Wed Sep 25 15:11:30 UTC 2019

25.09.2019 16:52, John Snow wrote:
> 
> 
> On 8/20/19 6:25 PM, John Snow wrote:
>> Hi, downstream here at Red Hat I've been fielding some questions about
>> the usability and feature readiness of Bitmaps (and related features) in
>> QEMU.
>>
>> Here are some questions I answered internally that I am copying to the
>> list for two reasons:
>>
>> (1) To make sure my answers are actually correct, and
>> (2) To share this pseudo-reference with the community at large.
>>
>> This is long, and mostly for reference. There's a summary at the bottom
>> with some todo items and observations about the usability of the feature
>> as it exists in QEMU.
>>
>> Before too long, I intend to send a more summarized "roadmap" mail which
>> details all of the current and remaining work to be done in and around
>> the bitmaps feature in QEMU.
>>
>>
>> Questions:
>>
>>> "What format(s) is/are required for this functionality?"
>>
>>  From the QEMU API, any format can be used to create and author
>> incremental backups. The only known format limitations are:
>>
>> 1. Persistent bitmaps cannot be created on any format except qcow2,
>> although there are hooks to add support to other formats at a later date
>> if desired.
>>
>> DANGER CAVEAT #1: Adding bitmaps to QEMU by default creates transient
>> bitmaps instead of persistent ones.
>>
>> Possible TODO: Allow users to 'upgrade' transient bitmaps to persistent
>> ones in case they made a mistake.
>>
>>
>> 2. When using push backups (blockdev-backup, drive-backup), you may use
>> any format as a target format.
>>
>> DANGER CAVEAT #2: without backing file and/or filesystem-less sparse
>> support, these images will be unusable.
>>
>> EXAMPLE: Backing up to a raw file loses allocation information, so we
>> can no longer distinguish between zeroes and unallocated regions. The
>> cluster size is also lost. This file will not be usable without
>> additional metadata recorded elsewhere.*
>>
>> (* This is complicated, but it is in theory possible to do a push backup
>> to e.g. an NBD target with custom server code that saves allocation
>> information to a metadata file, which would allow you to reconstruct
>> backups. For instance, recording in a .json file which extents were
>> written out would allow you to -- with a custom binary -- write this
>> information on top of a base file to reconstruct a backup.)
>>
>>
>> 3. Any format can be used for either shared storage or live storage
>> migrations. There are TWO distinct mechanisms for migrating bitmaps:
>>
>> A) The bitmap is flushed to storage and re-opened on the destination.
>> This is only supported for qcow2 and shared-storage migrations.
>>
>> B) The bitmap is live-migrated to the destination. This is supported for
>> any format and can be used for either shared storage or live storage
>> migrations.
>>
>> DANGER CAVEAT #3: The second bitmap migration technique there is an
>> optional migration capability that must be enabled explicitly.
>> Otherwise, some migration combinations may drop bitmaps.
>>
>> Matrix:
>>
>>> migrate = migrate_capability or (persistent and shared_storage)
>>
>> Enumerated:
>>
>> live storage + raw : transient + no-capability: Dropped
>> live-storage + raw : transient + bm-capability: Migrated
>> live-storage + qcow2 : transient + no-capability: Dropped
>> live-storage + qcow2 : transient + bm-capability: Migrated
>> live-storage + qcow2 : persistent + no-capability: Dropped (!)
>> live-storage + qcow2 : persistent + bm-capability: Migrated
>>
>> shared-storage + raw : transient - no-capability: Dropped
>> shared-storage + raw : transient + bm-capability: Migrated
>> shared-storage + qcow2 : transient + no-capability: Migrated
>> shared-storage + qcow2 : transient + bm-capability: Migrated
>> shared-storage + qcow2 : persistent + no-capability: Migrated
>> shared-storage + qcow2 : persistent + bm-capability: Migrated
>>
>> Enabling the bitmap migration capability will ALWAYS migrate the bitmap.
>> If it's disabled, we will only migrate the bitmaps for shared storage
>> migrations where the bitmap is persistent, which is a qcow2-only case.
>>
>> There is no warning or error if you attempt to migrate in a manner that
>> loses your bitmaps.
>>
>> (I might be persuaded to add a case for when you are doing a live
>> storage migration of qcow2 with persistent bitmaps, which is somewhat a
>> conflicting case: you've asked for the bitmap to be persistent, but it
>> seems likely that if this ever happens in practice, it's because you
>> have neglected to ask for it to be migrated to the new host.)
>>
>> See iotest 169 for more details on this.
>>
>> At present, these are the only format limitations I am consciously aware
>> of. From a management API/GUI perspective, it makes sense to restrict
>> the feature set to "qcow2 only" to minimize edge cases.
>>
>>
>>> "Is libvirt aware of these 'gotcha' cases?"
>>
>>  From talks I've had with Eric Blake and Peter Krempa, they certainly are
>> now.
>>
>>
>>> "Is it possible to make persistent the default?"
>>
>> Not quickly.
>>
>> In QEMU, not without a deprecation period or some other incompatibility.
>> Default values are not (yet?) introspectable via the schema. We need
>> (possibly) up to two QAPI extensions:
>>
>> I) The ability to return deprecation warnings when issuing a command
>> that will cease to work in the future.
>>
>> This has been being discussed somewhat on-list recently. It seems like
>> there is not a big appetite for tackling something perceived as
>> low-value because it is likely to be ignored.
>>
>> II) The ability to document default values in the QAPI schema for the
>> purposes of introspection.
>>
>> With one or both of these extensions, we could remove the default value
>> for persistence and promote it to a required argument with a
>> transitionary period where it will work with a warning. Then, in the
>> future, users will be forced to specify if they want persistent=true or
>> persistent=false.
>>
>> This is not on my personal roadmap to implement.
>>
>>
>>> "Is it possible to make bitmap migration the default?"
>>
>> I don't know at present. Migration capabilities are either "on" or "off"
>> and the existing negotiation mechanisms for capabilities are extremely
>> rudimentary.
>>
>> Changing this might require fiddling with machine compat properties,
>> adding features to the migration protocol, or more. I don't know what I
>> don't know, so I will estimate this change as likely invasive.
>>
>> I've discussed this with David Gilbert and it seems like a complicated
>> project for the benefit of this sub-project alone, so this isn't on my
>> personal roadmap to resolve.
>>
>> The general consensus appears to be that protecting the user is
>> libvirt's job.
>>
>>
>>> "Where do we stand with external snapshot support?"
>>
>> Still broken. In the aftermath of 4.1, it's the most obvious outstanding
>> broken feature. Vladimir has patches to fix it, but they need some
>> attention.
>>
> 
> It looks as if that the fix is a little risky, but the correct fix is
> going to be much harder. Our reopen support simply does not accommodate
> images needing to write dirty bits on open in a hierarchical graph.

I tried the hard way, you may look through previous series versions.
Kevin disliked it.

> 
>>
>>> "What needs to happen to bitmaps when doing stream or commit?"
>>
>> Uncertain in QEMU; creating an external snapshot implicitly ends the
>> timeslice represented by the old bitmap, but an explicit checkpoint is
>> better.
>>
>> I think some little ascii charts will help people understand what we're
>> talking about here, so let's cover some examples.
>>
>>
>> SCENARIO 1)
>>
>> Here's a timeline for a single node (one image, no backing files), with
>> some points in time highlighted:
>>
>> Time T = 0.........................n
>> +rec:    [--X------Y------Z--------]
>> -rec:    [---------x------y--------]
>> region:  [aabbbbbbbcccccccddddddddd]
>>
>>
>> X, Y, and Z are points in time where bitmaps 'x', 'y', and 'z' were
>> created and began recording. x and y are points in time where bitmaps
>> 'x' and 'y' stopped recording.
>>
>> This creates a few distinct regions / timeslices.
>>
>> a: Data written before we began tracking writes.
>> b: Data written to bitmap 'x'
>> c: Data written to bitmap 'y'
>> d: data written to bitmap 'z'
>>
>> region 'a' is of an unknown size and indeterminate length, because there
>> is no reference point (checkpoint) prior to it.
>>
>> regions 'b' and 'c' are of finite size and determinate length, because
>> they have fixed reference points on either sides of their timeslice.
>>
>> region 'd' is also of an unknown size and indeterminate length, because
>> it is actively recording and has no checkpoint to its right. It may be
>> fixed at any time by disabling bitmap 'z'.
>>
>> In QEMU, generally what we want to do is to do several things at one
>> atomic moment to keep these regions adjacent, contiguous, and disjoint.
>> So from a high-level (using a fictional simplified syntax), we do:
>>
>> Transaction(
>>      create('y'),
>>      disable('x'),
>>      backup('x')
>> )
>>
>> which together performs a backup+checkpoint.
>>
>> We can do a backup without a checkpoint:
>>
>> 4.1:
>> Transaction(
>>      create('tmp')
>>      merge('tmp', 'x')
>>      backup('tmp')
>> )
>>
>> 4.2:
>>> backup('x', bitmap_sync=never)
>>
>> Or a checkpoint without a backup:
>>
>> Transaction(
>>      create('y'),
>>      disable('x')
>> )
>>
> 
> Concerning the following scenario:
> 
>>
>> SCENARIO 2)
>>
>> Now, what happens when we make an external snapshot and do nothing at
>> all to our bitmaps?
>>
>> Time T = 0.......................................n
>> +rec:    [--X------Y------Z--------] <-- [-------]
>> -rec:    [---------x------y--------] <-- [-------]
>> region:  [aabbbbbbbcccccccddddddddd] <-- [eeeeeee]
>>           {          base           } <-- {  top  }
>>
>> We've created a new implicit timeslice, "e" without creating a new
>> bitmap. Because the bitmap 'z' was still active at the time of the
>> snapshot, it now has a temporarily-determinate endpoint to its region.
>>
>> This is kind of like an "implied checkpoint", but it's a very poor one
>> because it's not really addressable.
>>
>> DANGER CAVEAT #4: We have no way to create incremental backups anymore,
>> because the current moment in time has no addressable point.
>>
>> That's not great; but it is likely a fixable scenario when commit is
>> fixed: committing the top layer back down into the base layer will add
>> all new writes to the end of the old region; restoring our backup chain:
>>
>> Time T = 0.........................C.......n
>> +rec:    [--X------Y------Z-------- -------]
>> -rec:    [---------x------y-------- -------]
>> region:  [aabbbbbbbcccccccddddddddd ddddddd]
>>
>> Here, region 'e' just gets appended to region d, and we can make
>> incremental backups from any checkpoint X, Y, Z to the current moment again.
>>
> 
> It's been brought to my attention that oVirt wants to be able to create
> snapshots offline.
> 
> It's not clear if they are willing to make these snapshots using
> libvirt's offline support, or if they want to do it using qemu-img directly.
> 
> If using libvirt, libvirt will be able to manage bitmaps as it sees fit,
> even offline, using qemu and QMP to manage the images (offline).
> 
> If it's the second, this snapshot scenario is the one they will
> encounter, where we have a top layer that has no inherent checkpoint or
> bitmap information.
> 
> Ramifications of this were discussed below in the original email:
> [scroll ...]
> 
>>
>> SCENARIO 3)
>>
>> What happens if we make a firm checkpoint at the same time we make the
>> snapshot?
>>
>> Transaction(
>>      disable('z'),
>>      snapshot('top'),
>>      create('w')
>> )
>>
>> Time T = 0.........................         ......n
>> +rec:    [--X------Y------Z-------- ] <-- [W------]
>> -rec:    [---------x------y--------z] <-- [-------]
>> region:  [aabbbbbbbcccccccddddddddd ] <-- [eeeeeee]
>>           {          base            } <-- {  top  }
>>
>> Now instead of the new region 'e' being implied, it's explicit. We can
>> make backups between any point and the current moment *across* the gap.
>>
>> It was my thought that this was the most preferable method that libvirt
>> should use, but there is some doubt from Peter Krempa. We'll see how it
>> shakes out.
>>
>>
>>
>> There are questions about what QEMU should do by default, without
>> libvirt's help. At the moment, it's "nothing" but there have been
>> questions about "something".
>>
>> Keeping in mind that we likely can't change our existing behavior
>> without some kind of flag, there are still some API/usability questions:
>>
>>
>>> If we create an external snapshot on top of an image with actively
>>> recording bitmaps, should we disable them?
>>
>> We can leave them enabled, but they'll never see any writes. Or we can
>> explicitly disable them. Explicitly disabling them may make more sense
>> to prevent modifying bitmaps accidentally on commit.
>>
>> My guess: No. we should leave them alone; allow checkpoint creation
>> mechanisms to do the disable+create dance for bitmaps as needed.
>>
>> Potential problems: The backing image is read-only, and if we change our
>> mind later, we will need to find a way to re-open the backing image as
>> read-write for the purposes of toggling the recording bit prior to any
>> legitimate guest usage of that node. Then, re-open as RO again.
>>
>>
>>
>>> Should we fork bitmaps (on snapshot)?
>>
>> If a bitmap named 'z' is recording when we create an external snapshot,
>> should that bitmap be *copied* into the top layer?
>>
>> My guess: No.
>>
>> This would allow us to create external snapshots *without* creating a
>> checkpoint, but conceptually that's a nightmare: It would allow for
>> mutually independent creation of snapshots OR checkpoints. This would be
>> hard to corral when undoing a snapshot, for instance.
>>
>> In my opinion, snapshots MUST be checkpoints, and therefore allowing a
>> snapshot without creating a checkpoint is a no-go.
>>
>>
>>> (Should we fork bitmaps) if we're not using checkpoints?
>>
>> If we are using a checkpoint-less paradigm (i.e. the rolling incremental
>> backup using only one bitmap) we might want to copy the bitmap up to
>> make the next incremental backup as if nothing ever happened.
>>
>> However, rolling incremental backups doesn't need any kind of auto-copy
>> feature. This is possible today:
>>
>>> create('base', 'A')
>>> transact(snapshot('top'), create('top', 'B'))
>>> merge('B', [('base', 'A'), ('top', 'B')])
>>
>> i.e., we create a new bitmap on the top layer, then merge in the old
>> data from the backing file, which remains addressable.
>>
>> Whether the user wants to copy up or not, there are commands that will
>> do that already.
>>
>>
> 
> ... this following section covers some of avoiding the problems of the
> scenario I replied to above, but mostly in the context of what QEMU can
> do to prevent the scenario -- to which the conclusion was "nothing,"
> especially if snapshots are created without QEMU's facilitation (via
> qemu-img.)
> 
>>> Should we create new bitmaps by default when we can?
>>
>> If a backing image has bitmaps, should QEMU automatically create a new
>> bitmap for the top layer? Should it be named something new, something
>> user-provided, or based on existing active bitmaps?
>>
>> If a user creates a new external snapshot with no consideration paid to
>> bitmaps (like "SCENARIO 2" above), they temporarily lose the ability to
>> do incremental backups. They might be able to commit the image back to
>> "try again."
>>
>> That's not great. Here are some options for resolving this:
>>
>> - Automatic names: Might cause collisions out-of-band with management
>> tooling by accident, tooling has to query to discover what bitmaps were
>> automatically created.
>>
>> - Same names: Can create namespace confusion when committing snapshots
>> later; although each layer of a backing chain can have bitmaps named the
>> same thing, it causes future problems when committing together that can
>> be hard to resolve.
>>
>> - User-provided name: This is workable, and requires an amendment to the
>> snapshot command to automatically create a new bitmap on the snapshot.
>>
>>
>> My guess: No, we can't automatically create a new bitmap for the user.
>> We can amend the snapshot commands to accept bitmap names, but at that
>> point we've just duplicated transactions:
>>
>> Transact(
>>      snapshot('top'),
>>      create('top', 'new-bitmap')
>> )
>>
> 
> There's one last relevant mitigation discussed further down: [scroll ...]
> 
>>
>> All that said (Mostly a lot "No, let's not do anything"), maybe there's
>> room for an "assistive" mode for users, a bitmap-aware snapshot creation
>> command. It could do the following well-defined magic:
>>
>> bitmap-snapshot(base, top, bitmap_name):
>>      1. disable any active bitmaps in the base node.
>>      2. create a bitmap named "bitmap_name" in the top node, failing if
>>         a bitmap by that name already exists on either node.
>>
>> What this accomplishes:
>> - Disables any bitmaps in the base layer ahead of time, in preparation
>> for an eventual commit operation.
>> - Always creates a new, enabled bitmap on the snapshot mode which is
>> guaranteed not to conflict with a name on the base node. This bitmap can
>> be used to create additional copies post-hoc, if desired.
>> - Formalizes our "best practice" suggestion for mixing bitmaps and
>> snapshots into a single, documented command.
>>
>> Is this strictly needed? No, if you have the foresight, you can do this
>> instead:
>>
>> Transact(
>>      disable('a'),
>>      disable('b'),
>>      disable('c'),
>>      # plus however many more ...
>>      snapshot('top', ...),
>>      create('top', 'd')
>> )
>>
>> but a convenience command might still have a role to play in helping
>> take the guesswork out for non-libvirt users.
>>
>>
>>
>> That's the bulk of what was discussed.
>>
>> Summary:
>>
>>
>> GOTCHAs:
>> #1: Bitmaps are created non-persistent by default, and can't be changed.
>>
>> #2: Push backup destination formats will happily back up to a format
>> that isn't semantically useful.
>>
>> #3: Migrating non-shared block storage can drop even persistent bitmaps
>> if you don't pass the bitmap migration capability flag to both QEMU
>> instances.
>>
>> #4: Creating a snapshot without doing some bitmap manipulation
>> beforehand can temporarily render your bitmaps unusable. Failing to
>> disable bitmaps before creating a snapshot might make commits very
>> tricky later on.
>>
>> Gotchas 1 and 4 can be at least partially alleviated. gotcha 2 remains a
>> pain point we cannot intercept at the QEMU layer. gotcha 3 has potential
>> remedies, but they are complicated.
>>
>>
>> QEMU todo items:
>> - Fix bitmap data corruption on commit (Ongoing, by Vladimir at Virtuozzo)
>>
>> - add a set_persistence method for bitmaps that allows us to change the
>> storage class of a bitmap after creation. (Helps alleviate gotcha #1.)
>>
>> - Add a command that allows us to merge allocation data into a bitmap.
>> This helps alleviate gotcha #4: If we create a new image but neglected
>> to do the proper transaction dance, we can simply copy the allocation
>> data into a new bitmap. (Note, we'd still need set_persistence to help
>> us disable the old bitmap before any commit happens.)
>>
> 
> ... This was perceived at the time to be an unnecessary convenience
> feature, because the belief was that libvirt should simply avoid this
> from happening in the first place.
> 
> However, if we acknowledge that snapshots may be made without libvirt's
> help, this is a quick and easy way to "fix" checkpoint consistency post-hoc.

Still, even without libvirt, management tool should avoid this from happening.
Or we are saying about using qemu-img by hand by end-user without any management?

And I'm still sure, that qemu-img is wrong instrument and better is to use qemu
in stopped state for offline manipulations.

But I'm not opposite to the idea, it should work of course.

> 
> --js
> 
>> - Add convenience command for easy + safe combination of bitmaps +
>> snapshots. Helps prevent #4.
>>
>>
>> Research items:
>> - How hard is it to reopen a backing image as RW while it's in-use,
>> disable a bitmap, and then reopen as RO? This is to partially address
>> gotcha #4; if we forget to disable bitmaps before creating the snapshot.
>>
>> - How hard is the reverse operation? Can we reopen a backing image RW,
>> enable a bitmap, and then reopen as RO? This gives us better control
>> over what happens on commit.
>>
>> - After we fix the commit bug, what does/should commit actually do with
>> bitmaps? What about bitmaps that collide? The current behavior is that
>> any bitmaps don't transfer from top to base. Any bitmaps active in the
>> base record all the new writes from the top.
>>
>>
>> That's all!
>> --js
>>

-- 
Best regards,
Vladimir