[libvirt] [RFC v2] external (pull) backup API

Thu Apr 12 23:53:17 UTC 2018

On 04/12/2018 08:57 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 12.04.2018 07:14, John Snow wrote:
>>
>>
>> On 04/11/2018 12:32 PM, Eric Blake wrote:
>>> On 04/03/2018 07:01 AM, Nikolay Shirokovskiy wrote:
>>>> Hi, all.                                                                                         
>>>>                                                                                                  
>>>> This is another RFC on pull backup API. This API provides means to read domain                   
>>>> disks in a snapshotted state so that client can back them up as well as means                    
>>>> to write domain disks to revert them to backed up state. The previous version                    
>>>> of RFC is [1]. I'll also describe the API implementation details to shed light                   
>>>> on misc qemu dirty bitmap commands usage.                                                        
>>>
> 
> 
> 
> [snip]
> 
> 
>>>>
>>>> Qemu can track what disk's blocks are changed from snapshotted state so on next
>>>> backup client can backup only changed blocks. virDomainBlockSnapshotCreateXML
>>>> accepts VIR_DOMAIN_BLOCK_SNAPSHOT_CREATE_CHECKPOINT flag to turn this option
>>>> for snapshot which means to track changes from this particular snapshot. I used
>>>> checkpoint term and not [dirty] bitmap because many qemu dirty bitmaps are used
>>>> to provide changed blocks from the given checkpoint to current snapshot in
>>>> current implementation (see *Implemenation* section for more details). Also
>>>> bitmap keeps block changes and thus itself changes in time and checkpoint is
>>>> a more statical terms means you can query changes from that moment in time.
>>>>
>>>> Checkpoints are visible in active domain xml:
>>>>
>>>>     <disk type='file' device='disk'>
>>>>       ..
>>>>       <target dev='sda' bus='scsi'/>
>>>>       <alias name='scsi0-0-0-0'/>
>>>>       <checkpoint name="93a5c045-6457-2c09-e56c-927cdf34e178">
>>>>       <checkpoint name="5768a388-c1c4-414c-ac4e-eab216ba7c0c">
>>>>       ..
>>>>     </disk>
>>>>
>>
>> It makes sense to avoid the bitmap name in libvirt, but do these indeed
>> correlate 1:1 with bitmaps?
>>
>> I assume each bitmap will have name=%%UUID%% ?
> 
> There is 1:1 correlation but names are different. Checkout checkpoints subsection
> of *implementation details* section below for naming scheme.
> 

Yeah, I saw later. You have both "checkpoints" (associated with bitmaps)
and then the bitmaps themselves.

>>
>>>> Every checkpoint requires qemu dirty bitmap which eats 16MiB of RAM with default
>>>> dirty block size of 64KiB for 1TiB disk and the same amount of disk space is used.
>>>> So client need to manage checkpoints and delete unused. Thus next API function:
>>>>
>>>>
> 
> 
> 
> [snip]
> 
> 
> 
>>>> First a few facts about qemu dirty bitmaps.
>>>>
>>>> Bitmap can be either in active or disable state. In disabled state it does not
>>>> get changed on guest writes. And oppositely in active state it tracks guest
>>>> writes. This implementation uses approach with only one active bitmap at
>>>> a time. This should reduce guest write penalties in the presence of
>>>> checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes
>>>> from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap
>>>> B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2
>>>> - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and
>>>> gets most disk change after latest snapshot.
>>
>> So you are trying to optimize away write penalties if you have, say, ten
>> bitmaps representing checkpoints so we don't have to record all new
>> writes to all ten.
>>
>> This makes sense, and I would have liked to formalize the concept in
>> QEMU, but response to that idea was very poor at the time.
>>
>> Also my design was bad :)
>>
>>>>
>>>> Getting changed blocks bitmap from some checkpoint in past till current snapshot
>>>> is quite simple in this scheme. For example if the last snapshot is 7 then
>>>> to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3,
>>>> B4, B4 and B6. Merge is just logical OR on bitmap bits.
>>>>
>>>> Deleting a checkpoint somewhere in the middle of checkpoint sequence requires
>>>> merging correspondent bitmap to the previous bitmap in this scheme.
>>>>
>>
>> Previous, or next?
> 
> In short previous.
> 
>>
>> Say we've got bitmaps (in chronological order from oldest to newest)
>>
>> A B C D E F G H
>>
>> and we want to delete bitmap (or "checkpoint") 'C':
>>
>> A B   D E F G H
>>
>> the bitmap representing checkpoint 'D' should now contain the bits that
>> used to be in 'C', right? That way all the checkpoints still represent
>> their appropriate points in time.
> 
> I merge to previous due to definition above. "A" contains changes from
> point in time A to point in time B an so no. So if you delete C in
> order for B to keep changes from point in time B to point in time D
> (next in checkpoint chain) you need merge C to B.
> 

I'm not sure the way it's explained here makes sense to me, but
Vladimir's explanation does.

>>
>>
>> The only problem comes when you delete a checkpoint on the end and the
>> bits have nowhere to go:
>>
>> A B C
>>
>> A B _
>>
>> In this case you really do lose a checkpoint -- but depending on how we
>> annotate this, it may or may not be possible to delete the most recent
>> checkpoint. Let's assume that the currently active bitmap that doesn't
>> represent *any* point in time yet (because it's still active and
>> recording new writes) is noted as 'X':
>>
>> A B C X
>>
>> If we delete C now, then, that bitmap can get re-merged into the *active
>> bitmap* X:
>>
>> A B _ X
> 
> You can delete any bitmap (and accordingly any checkpoint). If checkpoint
> is last we just merge last bitmap to previous and additioanlly make the
> previous bitmap active.
> 
>>
>>>> We use persitent bitmaps in the implementation. This means upon qemu process
>>>> termination bitmaps are saved in disks images metadata and restored back on
>>>> qemu process start. This makes checkpoint a persistent property that is we
>>>> keep them across domain start/stops. Qemu does not try hard to keep bitmaps.
>>>> If upon save something goes wrong bitmap is dropped. The same is applied to the
>>>> migration process too. For backup process it is not critical. If we don't
>>>> discover a checkpoint we always can make a full backup. Also qemu provides no
>>>> special means to track order of bitmaps. These facts are critical for
>>>> implementation with one active bitmap at a time. We need right order of bitmaps upon
>>>> merge - for snapshot N and block changes from snanpshot K, K < N to N we need
>>>> to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged
>>>> is missing we can't calculate desired block changes too.
>>>>
>>
>> Right. A missing bitmap anywhere in the sequence invalidates the entire
>> sequence.
>>
>>>> So the implementation encode bitmap order in their names. For snapshot A1, bitmap
>>>> name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming
>>>> encoding upon domain start we can find out bitmap order and check for missing
>>>> ones. This complicates a bit bitmap removing though. For example removing
>>>> a bitmap somewhere in the middle looks like this:
>>>>
>>>> - removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1}
>>>>     - create new bitmap named NAME_{K+1}^NAME_{K-1}      ---. 
>>>>     - disable new bitmap                                    | This is effectively renaming
>>>>     - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap    | of bitmap K+1 to comply the naming scheme
>>>>     - remove bitmap NAME_{K+1}^NAME_{K}                  ___/
>>>>     - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2}
>>>>     - remove bitmap NAME_{K}^NAME_{K-1}
>>>>
>>>> As you can see we need to change name for bitmap K+1 to keep our bitmap
>>>> naming scheme. This is done creating new K+1 bitmap with appropriate name
>>>> and copying old K+1 bitmap into new.
>>>>
>>
>> That seems... unfortunate. A record could be kept in libvirt instead,
>> couldn't it?
>>
>> A : Bitmap A, Time 12:34:56, Child of (None), Parent of B
>> B : Bitmap B, Time 23:15:46, Child of A, Parent of (None)
> 
> Yes it is possible. I was reluctant to implement this way for a couple of reasons:
> 
> - if bitmap metadata is in libvirt we need carefully design it for
>   things like libvirtd crashes. If metadata is out of sync with qemu then we can get
>   broken incremental backups. One possible design is:
> 
>     - on bitmap deletion save metadata after deletion bitmap in qemu; in case
>       of libvirtd crash in between upon libvirtd restart we can drop bitmaps
>       that are in metadata but not in qemu as already deleted
> 
>     - on bitmap add (creating new snapshot with checkpoint) save metadata with bitmap before
>       creating bitmap in qemu; then again we have a way to handle libvirtd crashes
>       in between
>   
>   So this approach has tricky parts too. The suggested approach uses qemu
>   transactions to keep bitmap consistent.
> 
> - I don't like another metadata which looks like belongs to disks and not
>   a domain. It is like keeping disk size in domain xml.
> 

Yeah, I see... Having to rename bitmaps in the middle of the chain seems
unfortunate, though...

and I'm still a little wary of using the names as important metadata to
be really honest. It feels like a misuse of the field.

>>
>> I suppose in this case you can't *reconstruct* this information from the
>> bitmap stored in the qcow2, which necessitates your naming scheme...
>>
>> ...Still, if you forego this requirement, deleting bitmaps in the middle
>> becomes fairly easy.
>>
>>>> So while it is possible to have only one active bitmap at a time it costs
>>>> some exersices at managment layer. To me it looks like qemu itself is a better
>>>> place to track bitmaps chain order and consistency.
>>>
>>
>> If this is a hard requirement, it's certainly *easier* to track the
>> relationship in QEMU ...
>>
>>> Libvirt is already tracking a tree relationship between internal
>>> snapshots (the virDomainSnapshotCreateXML), because qemu does NOT track
>>> that (true, internal snapshots don't get as much attention as external
>>> snapshots) - but the fact remains that qemu is probably not the best
>>> place to track relationship between multiple persistent bitmaps, any
>>> more than it tracks relationships between internal snapshots.  So having
>>> libvirt track relations between persistent bitmaps is just fine.  Do we
>>> really have to rename bitmaps in the qcow2 file, or can libvirt track it
>>> all on its own?
>>>
>>
>> This is a way, really, of storing extra metadata by using the bitmap
>> name as arbitrary data storage.
>>
>> I'd say either we promote QEMU to understanding checkpoints, or enhance
>> libvirt to track what it needs independent of QEMU -- but having to
>> rename bitmaps smells fishy to me.
>>
>>> Earlier, you said that the new virDomainBlockSnapshotPtr are
>>> independent, with no relations between them.  But here, you are wanting
>>> to keep incremental backups related to one another.
>>>
>>
>> I think the *snapshots*, as temporary objects, are independent and don't
>> carry a relation to each other.
>>
>> The *checkpoints* here, however, are persistent and interrelated.
>>
>>>>
>>>> Now how exporting bitmaps looks like.
>>>>
>>>> - add to export disk snapshot N with changes from checkpoint K
>>>>     - add fleece blockdev to NBD exports
>>>>     - create new bitmap T
>>>>     - disable bitmap T
>>>>     - merge bitmaps K, K+1, .. N-1 into T
>>
>> I see; so we compute a new slice based on previous bitmaps and backup
>> arbitrary from that arbitrary slice.
>>
>> So "T" is a temporary bitmap meant to be discarded at the conclusion of
>> the operation, making it much more like a consumable object.
>>
>>>>     - add bitmap to T to nbd export
>>>>
>>>> - remove disk snapshot from export
>>>>     - remove fleece blockdev from NBD exports
>>>>     - remove bitmap T
>>>>
>>
>> Aha.
>>
>>>> Here is qemu commands examples for operation with checkpoints, I'll make
>>>> several snapshots with checkpoints for purpuse of better illustration.
>>>>
>>>> - create snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8 with checkpoint
>>>>     - same as without checkpoint but additionally add bitmap on fleece blockjob start
>>>>
>>>>     ...
>>>>     {
>>>>         "execute": "transaction"
>>>>         "arguments": {
>>>>             "actions": [
>>>>                 {
>>>>                     "type": "blockdev-backup"
>>>>                     "data": {
>>>>                         "device": "drive-scsi0-0-0-0",
>>>>                         "sync": "none",
>>>>                         "target": "snapshot-scsi0-0-0-0"
>>>>                     },
>>>>                 },
>>>>                 {
>>>>                     "type": "block-dirty-bitmap-add"
>>>>                     "data": {
>>>>                         "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>                         "node": "drive-scsi0-0-0-0",
>>>>                         "persistent": true
>>>>                     },
>>>>                 }
>>>
>>
>> So a checkpoint creates a reference point, but NOT a backup. You are
>> manually creating checkpoint instances.
>>
>> In this case, though, you haven't disabled the previous checkpoint's
>> bitmap (if any?) atomically with the creation of this one...
> 
> In the example this is first snapshot so there is no previous checkpoint
> and thus nothing to disable.
> 

OK, got it!

>>
>>
>>> Here, the transaction makes sense; you have to create the persistent
>>> dirty bitmap to track from the same point in time.  The dirty bitmap is
>>> tied to the active image, not the backup, so that when you create the
>>> NEXT incremental backup, you have an accurate record of which sectors
>>> were touched in snapshot-scsi0-0-0-0 between this transaction and the next.
>>>
>>>>             ]
>>>>         },
>>>>     }
>>>>
>>>> - delete snapshot d068765e-8b50-4d74-9b72-1e55c663cbf8
>>>>     - same as without checkpoints
>>>>
>>>> - create snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17 with checkpoint
>>>>     - same actions as for the first snapshot, but additionally disable the first bitmap
>>>
>>> Again, you're showing the QMP commands that libvirt is issuing; which
>>> libvirt API calls are driving these actions?
>>>
>>>>
>>>>     ...
>>>>     {
>>>>         "execute": "transaction"
>>>>         "arguments": {
>>>>             "actions": [
>>>>                 {
>>>>                     "type": "blockdev-backup"
>>>>                     "data": {
>>>>                         "device": "drive-scsi0-0-0-0",
>>>>                         "sync": "none",
>>>>                         "target": "snapshot-scsi0-0-0-0"
>>>>                     },
>>>>                 },
>>>>                 {
>>>>                     "type": "x-vz-block-dirty-bitmap-disable"
>>>>                     "data": {
>>>
>>> Do you have measurements on whether having multiple active bitmaps hurts
>>> performance?  I'm not yet sure that managing a chain of disabled bitmaps
>>> (and merging them as needed for restores) is more or less efficient than
>>> managing multiple bitmaps all the time.  On the other hand, you do have
>>> a point that restore is a less frequent operation than backup, so making
>>> backup as lean as possible and putting more work on restore is a
>>> reasonable tradeoff, even if it adds complexity to the management for
>>> doing restores.
>>>
>>
>> Depending on the number of checkpoints intended to be kept... we
>> certainly make no real promises on the efficiency of marking so many.
>> It's at *least* a linear increase with each checkpoint...
>>
>>>>                         "name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>                         "node": "drive-scsi0-0-0-0"
>>>>                     },
>>>>                 },
>>>>                 {
>>>>                     "type": "block-dirty-bitmap-add"
>>>>                     "data": {
>>>>                         "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>                         "node": "drive-scsi0-0-0-0",
>>>>                         "persistent": true
>>>>                     },
>>>>                 }
>>>>             ]
>>>>         },
>>>>     }
>>>>
>>
>> Oh, I see, you handle the "disable old" case here.
>>
>>>> - delete snapshot 0044757e-1a2d-4c2c-b92f-bb403309bb17
>>>> - create snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b with checkpoint
>>>>
>>>> - add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with
>>>>   changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8
>>>>     - same as add export without checkpoint, but aditionally
>>>>         - form result bitmap
>>>>         - add bitmap to NBD export
>>>>
>>>>     ...
>>>>     {
>>>>         "execute": "transaction"
>>>>         "arguments": {
>>>>             "actions": [
>>>>                 {
>>>>                     "type": "block-dirty-bitmap-add"
>>>>                     "data": {
>>>>                         "node": "drive-scsi0-0-0-0",
>>>>                         "name": "libvirt-__export_temporary__",
>>>>                         "persistent": false
>>>>                     },
>>>>                 },
>>>>                 {
>>>>                     "type": "x-vz-block-dirty-bitmap-disable"
>>>>                     "data": {
>>>>                         "node": "drive-scsi0-0-0-0"
>>>>                         "name": "libvirt-__export_temporary__",
>>>>                     },
>>>>                 },
>>>>                 {
>>>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>>>                     "data": {
>>>>                         "node": "drive-scsi0-0-0-0",
>>>>                         "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8"
>>>>                         "dst_name": "libvirt-__export_temporary__",
>>>>                     },
>>>>                 },
>>>>                 {
>>>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>>>                     "data": {
>>>>                         "node": "drive-scsi0-0-0-0",
>>>>                         "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>>>                         "dst_name": "libvirt-__export_temporary__",
>>>>                     },
>>>>                 }
>>>>             ]
>>>>         },
>>>>     }
>>
>> OK, so in this transaction you add a new temporary bitmap for export,
>> and merge the contents of two bitmaps into it.
>>
>> However, it doesn't look like you created a new checkpoint and managed
>> that handoff here, did you?
> 
> We don't need create checkpoints for the purpuse of exporting. Only temporary
> bitmap to merge appropriate bitmap chain.
> 

See reply below

>>
>>>>     {
>>>>         "execute": "x-vz-nbd-server-add-bitmap"
>>>>         "arguments": {
>>>>             "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b"
>>>>             "bitmap": "libvirt-__export_temporary__",
>>>>             "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>         },
>>
>> And then here, once the bitmap and the data is already frozen, it's
>> actually alright if we add the export at a later point in time.
>>
>>>
>>> Adding a bitmap to a server is would would advertise to the NBD client
>>> that it can query the
>>> "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace
>>> during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
>>>
>>
>> Don't know much about this, I stopped paying attention to the BLOCK
>> STATUS patches. Is the NBD spec the best way to find out the current
>> state right now?
>>
>> (Is there a less technical, briefer overview somewhere, perhaps from a
>> commit message or a cover letter?)
>>
>>>>     }
>>>>
>>>> -  remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export
>>>>     - same as without checkpoint but additionally remove temporary bitmap
>>>>     
>>>>     ...
>>>>     {
>>>>         "arguments": {
>>>>             "name": "libvirt-__export_temporary__",
>>>>             "node": "drive-scsi0-0-0-0"
>>>>         },
>>>>         "execute": "block-dirty-bitmap-remove"
>>>>     }
>>>>
>>
>> OK, this just deletes the checkpoint. I guess we delete the node and
> I would not call it checkpoint. Checkpoint is something visible to client.
> An ability to get CBT from that point in time.
> 
> Here we create a temporary bitmap to calculate desired CBT.
> 

Aha, right. I misspoke; but it's because in my mind I feel like creating
an export will *generally* be accompanied by a new checkpoint, so I was
surprised to see that missing from the example.

But, yes, there's no reason you *have* to create a new checkpoint when
you do an export -- but I suspect that when you DO create a new
checkpoint it's generally going to be accompanied by an export like
this, right?

>> stop the NBD server too, right?
> 
> yeah, just like in case without checkpoint (mentioned in this case description)
> 
>>
>>>> - delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17
>>>>     (similar operation is described in the section about naming scheme for bitmaps,
>>>>      with difference that K+1 is N here and thus new bitmap should not be disabled)
>>>
>>> A suggestion on the examples - while UUIDs are nice and handy for
>>> management tools, they are a pain to type and for humans to quickly
>>> read.  Is there any way we can document a sample transaction stream with
>>> all the actors involved (someone issues a libvirt API call XYZ, libvirt
>>> in turn issues QMP command ABC), and using shorter names that are easier
>>> to read as humans?
>>>
>>
>> Yeah, A-B-C-D terminology would be nice for the examples. It's fine if
>> the actual implementation uses UUIDs.
>>
>>>>
>>>>     {
>>>>         "arguments": {
>>>>             "actions": [
>>>>                 {
>>>>                     "type": "block-dirty-bitmap-add"
>>>>                     "data": {
>>>>                         "node": "drive-scsi0-0-0-0",
>>>>                         "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>                         "persistent": true
>>>>                     },
>>>>                 },
>>>>                 {
>>>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>>>                     "data": {
>>>>                         "node": "drive-scsi0-0-0-0",
>>>>                         "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>>>                         "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>                     },
>>>>                 },
>>>>                 {
>>>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>>>                     "data": {
>>>>                         "node": "drive-scsi0-0-0-0",
>>>>                         "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1#
>>>>                         "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>>>                     },
>>>>                 },
>>>>             ]
>>>>         },
>>>>         "execute": "transaction"
>>>>     }
>>>>     {
>>>>         "execute": "x-vz-block-dirty-bitmap-remove"
>>>>         "arguments": {
>>>>             "node": "drive-scsi0-0-0-0"
>>>>             "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17",
>>>>         },
>>>>     },
>>>>     {
>>>>         "execute": "x-vz-block-dirty-bitmap-remove"
>>>>         "arguments": {
>>>>             "node": "drive-scsi0-0-0-0"
>>>>             "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>         },
>>>>     }
>>>>
>>>> Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
>>>>
>>>> x-vz-block-dirty-bitmap-remove
>>
>> We already have this, right? It doesn't even need to be transactionable.
>>
>>>> x-vz-block-dirty-bitmap-merge
>>
>> You need this...
>>
>>>> x-vz-block-dirty-bitmap-disable
>>
>> And this we had originally but since removed, but can be re-added trivially.
>>
>>>> x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint)
>>>> x-vz-nbd-server-add-bitmap
>>>>
>>
>> Do my comments make sense? Am I understanding you right so far? I'll try
>> to offer a competing writeup to make sure we're on the same page with
>> your proposed design before I waste any time trying to critique it -- in
>> case I'm misunderstanding you.
> 
> Yes, looks like we are in tune.
> 

More or less. Thank you for taking the time to explain it all out to me.
I think I understand the general shape of your proposal, more or less.

>>
>> Thank you for leading the charge and proposing new APIs for this
>> feature. It will be very nice to expose the incremental backup
>> functionality we've been working on in QEMU to users of libvirt.
>>
>> --js
> 
> There are also patches too ( if API design survive review phase at least partially :) )
> 

I can only really help (or hinder?) where QEMU primitives are concerned
-- the actual libvirt API is going to be what Eric cares about.

I think this looks good so far, though -- at least, it makes sense to me.

>>
>>>> *Restore operation nuances*
>>>>
>>>> As it was written above to restore a domain one needs to start it in paused
>>>> state, export domain's disks and write them from backup. However qemu currently does
>>>> not let export disks for write even for a domain that never starts guests CPU.
>>>> We have an experimental qemu command option -x-vz-nbd-restore (passed together
>>>> with -incoming option) to fix it.
>>>
>>> Why can't restore be done while the guest is offline?  (Oh right, we
>>> still haven't added decent qemu-img support for bitmap manipulation, so
>>> we need a qemu process around for any bitmap changes).
>>>
>>
>> I'm working on this right now, actually!
>>
>> I'm working on JSON format output for bitmap querying, and simple
>> clear/delete commands. I hope to send this out very soon.
>>
>>> As I understand it, the point of bitmaps and snapshots is to create an
>>> NBD server that a third-party can use to read just the dirty portions of
>>> a disk in relation to a known checkpoint, to save off data in whatever
>>> form it wants; so you are right that the third party then needs a way to
>>> rewrite data from whatever internal form it stored it in back to the
>>> view that qemu can consume when rolling back to a given backup, prior to
>>> starting the guest on the restored data.  Do you need additional libvirt
>>> APIs exposed for this, or do the proposed APIs for adding snapshots
>>> cover everything already with just an additional flag parameter that
>>> says whether the <domainblocksnapshot> is readonly (the third-party is
>>> using it for collecting the incremental backup data) or writable (the
>>> third-party is actively writing its backup into the file, and when it is
>>> done, then perform a block-commit to merge that data back onto the main
>>> qcow2 file)?
>>>

Thank you!