[libvirt] [RFC v2] external (pull) backup API

Fri Apr 13 10:04:21 UTC 2018

On 13.04.2018 02:53, John Snow wrote:

>>
>>>>> First a few facts about qemu dirty bitmaps.
>>>>>
>>>>> Bitmap can be either in active or disable state. In disabled state it does not
>>>>> get changed on guest writes. And oppositely in active state it tracks guest
>>>>> writes. This implementation uses approach with only one active bitmap at
>>>>> a time. This should reduce guest write penalties in the presence of
>>>>> checkpoints. So on first snapshot we create bitmap B_1. Now it tracks changes
>>>>> from the snapshot 1. On second snapshot we create bitmap B_2 and disable bitmap
>>>>> B1 and so on. Now bitmap B1 keep changes from snaphost 1 to snapshot 2, B2
>>>>> - changes from snaphot 2 to snapshot 3 and so on. Last bitmap is active and
>>>>> gets most disk change after latest snapshot.
>>>
>>> So you are trying to optimize away write penalties if you have, say, ten
>>> bitmaps representing checkpoints so we don't have to record all new
>>> writes to all ten.
>>>
>>> This makes sense, and I would have liked to formalize the concept in
>>> QEMU, but response to that idea was very poor at the time.
>>>
>>> Also my design was bad :)
>>>
>>>>>
>>>>> Getting changed blocks bitmap from some checkpoint in past till current snapshot
>>>>> is quite simple in this scheme. For example if the last snapshot is 7 then
>>>>> to get changes from snapshot 3 to latest snapshot we need to merge bitmaps B3,
>>>>> B4, B4 and B6. Merge is just logical OR on bitmap bits.
>>>>>
>>>>> Deleting a checkpoint somewhere in the middle of checkpoint sequence requires
>>>>> merging correspondent bitmap to the previous bitmap in this scheme.
>>>>>
>>>
>>> Previous, or next?
>>
>> In short previous.
>>
>>>
>>> Say we've got bitmaps (in chronological order from oldest to newest)
>>>
>>> A B C D E F G H
>>>
>>> and we want to delete bitmap (or "checkpoint") 'C':
>>>
>>> A B   D E F G H
>>>
>>> the bitmap representing checkpoint 'D' should now contain the bits that
>>> used to be in 'C', right? That way all the checkpoints still represent
>>> their appropriate points in time.
>>
>> I merge to previous due to definition above. "A" contains changes from
>> point in time A to point in time B an so no. So if you delete C in
>> order for B to keep changes from point in time B to point in time D
>> (next in checkpoint chain) you need merge C to B.
>>
> 
> I'm not sure the way it's explained here makes sense to me, but
> Vladimir's explanation does.
> 
>>>
>>>
>>> The only problem comes when you delete a checkpoint on the end and the
>>> bits have nowhere to go:
>>>
>>> A B C
>>>
>>> A B _
>>>
>>> In this case you really do lose a checkpoint -- but depending on how we
>>> annotate this, it may or may not be possible to delete the most recent
>>> checkpoint. Let's assume that the currently active bitmap that doesn't
>>> represent *any* point in time yet (because it's still active and
>>> recording new writes) is noted as 'X':
>>>
>>> A B C X
>>>
>>> If we delete C now, then, that bitmap can get re-merged into the *active
>>> bitmap* X:
>>>
>>> A B _ X
>>
>> You can delete any bitmap (and accordingly any checkpoint). If checkpoint
>> is last we just merge last bitmap to previous and additioanlly make the
>> previous bitmap active.
>>
>>>
>>>>> We use persitent bitmaps in the implementation. This means upon qemu process
>>>>> termination bitmaps are saved in disks images metadata and restored back on
>>>>> qemu process start. This makes checkpoint a persistent property that is we
>>>>> keep them across domain start/stops. Qemu does not try hard to keep bitmaps.
>>>>> If upon save something goes wrong bitmap is dropped. The same is applied to the
>>>>> migration process too. For backup process it is not critical. If we don't
>>>>> discover a checkpoint we always can make a full backup. Also qemu provides no
>>>>> special means to track order of bitmaps. These facts are critical for
>>>>> implementation with one active bitmap at a time. We need right order of bitmaps upon
>>>>> merge - for snapshot N and block changes from snanpshot K, K < N to N we need
>>>>> to merge bitmaps B_{K}, ..., B_{N-1}. Also if one of the bitmaps to be merged
>>>>> is missing we can't calculate desired block changes too.
>>>>>
>>>
>>> Right. A missing bitmap anywhere in the sequence invalidates the entire
>>> sequence.
>>>
>>>>> So the implementation encode bitmap order in their names. For snapshot A1, bitmap
>>>>> name will be A1, for snapshot A2 bitmap name will be A2^A1 and so on. Using this naming
>>>>> encoding upon domain start we can find out bitmap order and check for missing
>>>>> ones. This complicates a bit bitmap removing though. For example removing
>>>>> a bitmap somewhere in the middle looks like this:
>>>>>
>>>>> - removing bitmap K (bitmap name is NAME_{K}^NAME_{K-1}
>>>>>     - create new bitmap named NAME_{K+1}^NAME_{K-1}      ---. 
>>>>>     - disable new bitmap                                    | This is effectively renaming
>>>>>     - merge bitmap NAME_{K+1}^NAME_{K} to the new bitmap    | of bitmap K+1 to comply the naming scheme
>>>>>     - remove bitmap NAME_{K+1}^NAME_{K}                  ___/
>>>>>     - merge bitmap NAME_{K}^NAME_{K-1} to NAME_{K-1}^NAME_{K-2}
>>>>>     - remove bitmap NAME_{K}^NAME_{K-1}
>>>>>
>>>>> As you can see we need to change name for bitmap K+1 to keep our bitmap
>>>>> naming scheme. This is done creating new K+1 bitmap with appropriate name
>>>>> and copying old K+1 bitmap into new.
>>>>>
>>>
>>> That seems... unfortunate. A record could be kept in libvirt instead,
>>> couldn't it?
>>>
>>> A : Bitmap A, Time 12:34:56, Child of (None), Parent of B
>>> B : Bitmap B, Time 23:15:46, Child of A, Parent of (None)
>>
>> Yes it is possible. I was reluctant to implement this way for a couple of reasons:
>>
>> - if bitmap metadata is in libvirt we need carefully design it for
>>   things like libvirtd crashes. If metadata is out of sync with qemu then we can get
>>   broken incremental backups. One possible design is:
>>
>>     - on bitmap deletion save metadata after deletion bitmap in qemu; in case
>>       of libvirtd crash in between upon libvirtd restart we can drop bitmaps
>>       that are in metadata but not in qemu as already deleted
>>
>>     - on bitmap add (creating new snapshot with checkpoint) save metadata with bitmap before
>>       creating bitmap in qemu; then again we have a way to handle libvirtd crashes
>>       in between
>>   
>>   So this approach has tricky parts too. The suggested approach uses qemu
>>   transactions to keep bitmap consistent.
>>
>> - I don't like another metadata which looks like belongs to disks and not
>>   a domain. It is like keeping disk size in domain xml.
>>
> 
> Yeah, I see... Having to rename bitmaps in the middle of the chain seems
> unfortunate, though...
> 
> and I'm still a little wary of using the names as important metadata to
> be really honest. It feels like a misuse of the field.

Me too. I'd like to see a different offer from qemu for bitmaps so that
mgmt does not have to do such heavy lifting.

[snip]

>>>>>
>>>>> - add disk snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b to export and bitmap with
>>>>>   changes from checkpoint d068765e-8b50-4d74-9b72-1e55c663cbf8
>>>>>     - same as add export without checkpoint, but aditionally
>>>>>         - form result bitmap
>>>>>         - add bitmap to NBD export
>>>>>
>>>>>     ...
>>>>>     {
>>>>>         "execute": "transaction"
>>>>>         "arguments": {
>>>>>             "actions": [
>>>>>                 {
>>>>>                     "type": "block-dirty-bitmap-add"
>>>>>                     "data": {
>>>>>                         "node": "drive-scsi0-0-0-0",
>>>>>                         "name": "libvirt-__export_temporary__",
>>>>>                         "persistent": false
>>>>>                     },
>>>>>                 },
>>>>>                 {
>>>>>                     "type": "x-vz-block-dirty-bitmap-disable"
>>>>>                     "data": {
>>>>>                         "node": "drive-scsi0-0-0-0"
>>>>>                         "name": "libvirt-__export_temporary__",
>>>>>                     },
>>>>>                 },
>>>>>                 {
>>>>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>>>>                     "data": {
>>>>>                         "node": "drive-scsi0-0-0-0",
>>>>>                         "src_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8"
>>>>>                         "dst_name": "libvirt-__export_temporary__",
>>>>>                     },
>>>>>                 },
>>>>>                 {
>>>>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>>>>                     "data": {
>>>>>                         "node": "drive-scsi0-0-0-0",
>>>>>                         "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>>>>                         "dst_name": "libvirt-__export_temporary__",
>>>>>                     },
>>>>>                 }
>>>>>             ]
>>>>>         },
>>>>>     }
>>>
>>> OK, so in this transaction you add a new temporary bitmap for export,
>>> and merge the contents of two bitmaps into it.
>>>
>>> However, it doesn't look like you created a new checkpoint and managed
>>> that handoff here, did you?
>>
>> We don't need create checkpoints for the purpuse of exporting. Only temporary
>> bitmap to merge appropriate bitmap chain.
>>
> 
> See reply below
> 
>>>
>>>>>     {
>>>>>         "execute": "x-vz-nbd-server-add-bitmap"
>>>>>         "arguments": {
>>>>>             "name": "sda-8fc02db3-166f-4de7-b7aa-1f7303e6162b"
>>>>>             "bitmap": "libvirt-__export_temporary__",
>>>>>             "bitmap-export-name": "d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>>         },
>>>
>>> And then here, once the bitmap and the data is already frozen, it's
>>> actually alright if we add the export at a later point in time.
>>>
>>>>
>>>> Adding a bitmap to a server is would would advertise to the NBD client
>>>> that it can query the
>>>> "qemu-dirty-bitmap:d068765e-8b50-4d74-9b72-1e55c663cbf8" namespace
>>>> during NBD_CMD_BLOCK_STATUS, rather than just "base:allocation"?
>>>>
>>>
>>> Don't know much about this, I stopped paying attention to the BLOCK
>>> STATUS patches. Is the NBD spec the best way to find out the current
>>> state right now?
>>>
>>> (Is there a less technical, briefer overview somewhere, perhaps from a
>>> commit message or a cover letter?)
>>>
>>>>>     }
>>>>>
>>>>> -  remove snapshot 8fc02db3-166f-4de7-b7aa-1f7303e6162b from export
>>>>>     - same as without checkpoint but additionally remove temporary bitmap
>>>>>     
>>>>>     ...
>>>>>     {
>>>>>         "arguments": {
>>>>>             "name": "libvirt-__export_temporary__",
>>>>>             "node": "drive-scsi0-0-0-0"
>>>>>         },
>>>>>         "execute": "block-dirty-bitmap-remove"
>>>>>     }
>>>>>
>>>
>>> OK, this just deletes the checkpoint. I guess we delete the node and
>> I would not call it checkpoint. Checkpoint is something visible to client.
>> An ability to get CBT from that point in time.
>>
>> Here we create a temporary bitmap to calculate desired CBT.
>>
> 
> Aha, right. I misspoke; but it's because in my mind I feel like creating
> an export will *generally* be accompanied by a new checkpoint, so I was
> surprised to see that missing from the example.
> 
> But, yes, there's no reason you *have* to create a new checkpoint when
> you do an export -- but I suspect that when you DO create a new
> checkpoint it's generally going to be accompanied by an export like
> this, right?

Generally speaking yes. But not always. Consider the first backup. It should
be full with no other option. We create snapshot with checkpoint but export
only snapshot. No need to export bitmap (bitmap from what checkpoint? current
checkpoint does not count). We create checkpoint because we want second backup
to be incremental from the first.

Later from time to time we will create full backups again so that incremental
chain is not too long I guess. Then again we create a snapshot with checkpoint
but not export any CBT.

In short we create a checkpoint not for the purpuse of export but rather to
have a point in time to get CBT from in later backups.

Actually in current implementation (not yet published) there is a restiriction that snaphot
can be exported with CBT only in case if snapshot was created with checkpoint.
This saves us from creating temporary qemu bitmap which marks end point in
time for requested CBT.

> 
>>> stop the NBD server too, right?
>>
>> yeah, just like in case without checkpoint (mentioned in this case description)
>>
>>>
>>>>> - delete checkpoint 0044757e-1a2d-4c2c-b92f-bb403309bb17
>>>>>     (similar operation is described in the section about naming scheme for bitmaps,
>>>>>      with difference that K+1 is N here and thus new bitmap should not be disabled)
>>>>
>>>> A suggestion on the examples - while UUIDs are nice and handy for
>>>> management tools, they are a pain to type and for humans to quickly
>>>> read.  Is there any way we can document a sample transaction stream with
>>>> all the actors involved (someone issues a libvirt API call XYZ, libvirt
>>>> in turn issues QMP command ABC), and using shorter names that are easier
>>>> to read as humans?
>>>>
>>>
>>> Yeah, A-B-C-D terminology would be nice for the examples. It's fine if
>>> the actual implementation uses UUIDs.
>>>
>>>>>
>>>>>     {
>>>>>         "arguments": {
>>>>>             "actions": [
>>>>>                 {
>>>>>                     "type": "block-dirty-bitmap-add"
>>>>>                     "data": {
>>>>>                         "node": "drive-scsi0-0-0-0",
>>>>>                         "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>>                         "persistent": true
>>>>>                     },
>>>>>                 },
>>>>>                 {
>>>>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>>>>                     "data": {
>>>>>                         "node": "drive-scsi0-0-0-0",
>>>>>                         "src_name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>>>>                         "dst_name": "libvirt-d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>>                     },
>>>>>                 },
>>>>>                 {
>>>>>                     "type": "x-vz-block-dirty-bitmap-merge"
>>>>>                     "data": {
>>>>>                         "node": "drive-scsi0-0-0-0",
>>>>>                         "src_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb1#
>>>>>                         "dst_name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^d068765e-8b50-4d74-9b72-1e55c663cbf#
>>>>>                     },
>>>>>                 },
>>>>>             ]
>>>>>         },
>>>>>         "execute": "transaction"
>>>>>     }
>>>>>     {
>>>>>         "execute": "x-vz-block-dirty-bitmap-remove"
>>>>>         "arguments": {
>>>>>             "node": "drive-scsi0-0-0-0"
>>>>>             "name": "libvirt-8fc02db3-166f-4de7-b7aa-1f7303e6162b^0044757e-1a2d-4c2c-b92f-bb403309bb17",
>>>>>         },
>>>>>     },
>>>>>     {
>>>>>         "execute": "x-vz-block-dirty-bitmap-remove"
>>>>>         "arguments": {
>>>>>             "node": "drive-scsi0-0-0-0"
>>>>>             "name": "libvirt-0044757e-1a2d-4c2c-b92f-bb403309bb17^d068765e-8b50-4d74-9b72-1e55c663cbf8",
>>>>>         },
>>>>>     }
>>>>>
>>>>> Here is a list of bitmap commands used in implementation but not yet in upstream (AFAIK).
>>>>>
>>>>> x-vz-block-dirty-bitmap-remove
>>>
>>> We already have this, right? It doesn't even need to be transactionable.
>>>
>>>>> x-vz-block-dirty-bitmap-merge
>>>
>>> You need this...
>>>
>>>>> x-vz-block-dirty-bitmap-disable
>>>
>>> And this we had originally but since removed, but can be re-added trivially.
>>>
>>>>> x-vz-block-dirty-bitmap-enable (not in the examples; used when removing most recent checkpoint)
>>>>> x-vz-nbd-server-add-bitmap
>>>>>
>>>
>>> Do my comments make sense? Am I understanding you right so far? I'll try
>>> to offer a competing writeup to make sure we're on the same page with
>>> your proposed design before I waste any time trying to critique it -- in
>>> case I'm misunderstanding you.
>>
>> Yes, looks like we are in tune.
>>
> 
> More or less. Thank you for taking the time to explain it all out to me.
> I think I understand the general shape of your proposal, more or less.
> 
>>>
>>> Thank you for leading the charge and proposing new APIs for this
>>> feature. It will be very nice to expose the incremental backup
>>> functionality we've been working on in QEMU to users of libvirt.
>>>
>>> --js
>>
>> There are also patches too ( if API design survive review phase at least partially :) )
>>
> 
> I can only really help (or hinder?) where QEMU primitives are concerned
> -- the actual libvirt API is going to be what Eric cares about.
> 
> I think this looks good so far, though -- at least, it makes sense to me.
> 
>>>
>>>>> *Restore operation nuances*
>>>>>
>>>>> As it was written above to restore a domain one needs to start it in paused
>>>>> state, export domain's disks and write them from backup. However qemu currently does
>>>>> not let export disks for write even for a domain that never starts guests CPU.
>>>>> We have an experimental qemu command option -x-vz-nbd-restore (passed together
>>>>> with -incoming option) to fix it.
>>>>
>>>> Why can't restore be done while the guest is offline?  (Oh right, we
>>>> still haven't added decent qemu-img support for bitmap manipulation, so
>>>> we need a qemu process around for any bitmap changes).
>>>>
>>>
>>> I'm working on this right now, actually!
>>>
>>> I'm working on JSON format output for bitmap querying, and simple
>>> clear/delete commands. I hope to send this out very soon.
>>>
>>>> As I understand it, the point of bitmaps and snapshots is to create an
>>>> NBD server that a third-party can use to read just the dirty portions of
>>>> a disk in relation to a known checkpoint, to save off data in whatever
>>>> form it wants; so you are right that the third party then needs a way to
>>>> rewrite data from whatever internal form it stored it in back to the
>>>> view that qemu can consume when rolling back to a given backup, prior to
>>>> starting the guest on the restored data.  Do you need additional libvirt
>>>> APIs exposed for this, or do the proposed APIs for adding snapshots
>>>> cover everything already with just an additional flag parameter that
>>>> says whether the <domainblocksnapshot> is readonly (the third-party is
>>>> using it for collecting the incremental backup data) or writable (the
>>>> third-party is actively writing its backup into the file, and when it is
>>>> done, then perform a block-commit to merge that data back onto the main
>>>> qcow2 file)?
>>>>
> 
> 
> Thank you!
>