[libvirt] RFC: API additions for enhanced snapshot support

Tue Jun 21 13:12:28 UTC 2011

On Tue, Jun 21, 2011 at 11:30 AM, Daniel P. Berrange
<berrange at redhat.com> wrote:
> On Wed, Jun 15, 2011 at 11:41:27PM -0600, Eric Blake wrote:
>> Right now, libvirt has a snapshot API via virDomainSnapshotCreateXML,
>> but for qemu domains, it only works if all the guest disk images are
>> qcow2, and qemu rather than libvirt does all the work.  However, it has
>> a couple of drawbacks: it is inherently tied to domains (there is no way
>> to manage snapshots of storage volumes not tied to a domain, even though
>> libvirt does that for qcow2 images associated with offline qemu domains
>> by using the qemu-img application).  And it necessarily operates on all
>> of the images associated with a domain in parallel - if any disk image
>> is not qcow2, the snapshot fails, and there is no way to select a subset
>> of disks to save.  However, it works on both active (disk and memory
>> state) and inactive domains (just disk state).
>>
>> Upstream qemu is developing a 'live snapshot' feature, which allows the
>> creation of a snapshot without the current downtime of several seconds
>> required by the current 'savevm' monitor command, as well as means for
>> controlling applications (libvirt) to request that qemu pause I/O to a
>> particular disk, then externally perform a snapshot, then tell qemu to
>> resume I/O (perhaps on a different file name or fd from the host, but
>> with no change to the contents seen by the guest).  Eventually, these
>> changes will make it possible for libvirt to create fast snapshots of
>> LVM partitions or btrfs files for guest disk images, as well as to
>
> Actually, IIUC, the QEMU 'live snapshot' feature is only for special
> disk formats like qcow2, qed, etc.

Yes.  The live snapshot feature in QEMU will not do btrfs, LVM, or
SAN/NAS snapshots.

> For formats like LVM, brtfs, SCSI, etc,  libvirt will have todo all
> the work of creating the snapshot, possibly then telling QEMU to
> switch the backing file of a virtual disk to the new image (if the
> snapshot mechanism works that way).

Putting non-virtualization storage management code into libvirt seems
suboptimal since other applications may also want to use these generic
features.  However, I'm not aware of a storage management API for
Linux that would support LVM and various SAN/NAS appliances.  Ideally
we would have something like that and libvirt can use the storage
management API without knowing all the different storage types.

A service like udisks with plugins for SAN/NAS appliances could solve
the problem of where to put the storage management code.

>> select which disks are saved in a snapshot (that is, save a
>> crash-consistent state of a subset of disks, without the corresponding
>> RAM state, rather than making a full system restore point); the latter
>> would work best with guest cooperation to quiesce disks before qemu
>> pauses I/O to that disk, but that is an orthogonal enhancement.
>
> At the very least, you need a way to store QEMU writing to the disk
> for a period of time, whether or not the guest is quiesced. There
> are basically 3 options
>
>  1. Pause the guest CPUs (eg  'stop' on the monitor)
>  2. QEMU queues I/O from guest in memory temporarily (does not currently exist)
>  3. QEMU tells guest to quiesce I/O temporarily (does not currently exist)
>
> To perform a snapshot libvirt would need todo
>
>  1. Stop I/O using one of the 3 methods above
>  2. If disk is a special format
>      - Ask QEMU to snapshot it
>    Else
>      - Create snapshot ourselves
>      - Update QEMU disk backing path (optional)
>  3. Resume I/O

Yes, QEMU needs to provide commands for these individual steps.  Also,
the guest must be notified of the snapshot operation so that it can
flush in-memory data to disk - otherwise this cannot be used for
backup purposes since guests with several GBs of RAM will keep a
considerable portion of state in memory and disk will be out-of-date.

>> /* Save a domain into the file 'to' with additional actions.  If flags
>> is 0, then xml is ignored, and this is like virDomainSave.  If flags
>> includes VIR_DOMAIN_SAVE_DISKS, then all of the associated disk images
>> are also snapshotted, as if by virStorageVolSnapshotCreateXML; the xml
>> argument is optional, but if present, it should be a <domainsnapshot>
>> element with <disk> sub-elements for directions on each disk that needs
>> a non-empty xml argument for proper volume snapshot creation.  If flags
>> includes VIR_DOMAIN_SAVE_RESUME, then the guest is resumed after the
>> offline snapshot is complete (note that VIR_DOMAIN_SAVE_RESUME without
>> VIR_DOMAIN_SAVE_DISKS makes little sense, as a saved state file is
>> rendered useless if the disk images are modified before it is resumed).
>>  If flags includes VIR_DOMAIN_SAVE_QUIESCE, this requests that a guest
>> agent quiesce disk state before the saved state file is created.  */
>> int virDomainSaveFlags(virDomainPtr domain, const char *to, const char
>> *xml, unsigned int flags);
>
>
> What I'm not seeing here, is how these APIs all relate to the existing
> support we have in virStorageVol APIs for creating snapshots. THis is
> already implemented for LVM, QCow, QCow2. The snapshots are created by
> specifying a backing file in the initial volume description. Depending
> on the storage type, the backing file for a snapshot can be writable,
> or readonly. Snapshots appear as just more storage volumes, and are not
> restricted to being within the same pool as the original volume. You can
> also mix storage formats, eg, create a Qcow2 volume with backing file
> on LVM, which is itself a snapshot of another LVM volume.
>
> The QCow2 internal snapshots don't really fit into our existing model,
> since they don't have extra associated external files, so maybe we do
> still want some of these explicit APIs to query snapshots against
> volumes.

There will be restrictions based on the storage type you choose.  For
example, LVM volumes can only be backed off other LVM volumes AFAIK.

It would be nice to expose snapshots as volumes and allow them to be
accessed using virStorageVolDownload() (or an equivalent API).

One snapshot use case is a backup solution that wants to integrate
virtualization support.  They would need to talk to libvirt to take
regular snapshots and copy out the data before deleting the snapshot.
In addition, they require a dirty block tracking API for incremental
backups where they avoid copying out the entire disk contents by only
copying the disk blocks that have changed.  Implementing dirty block
tracking for image formats in QEMU is possible and has been discussed
a little already.  btrfs also supports dirty block tracking between
transaction IDs.  Some SAN/NAS appliances may also expose this
information.

Any thoughts on a dirty block tracking API that produces an extent
list of dirty blocks between a snapshot and another snapshot/volume?
I think virStorageVolDownload() could be used to copy out only the
dirty blocks.

Stefan