[libvirt] RFC: API additions for enhanced snapshot support

Tue Jun 21 10:30:40 UTC 2011

On Wed, Jun 15, 2011 at 11:41:27PM -0600, Eric Blake wrote:
> Right now, libvirt has a snapshot API via virDomainSnapshotCreateXML,
> but for qemu domains, it only works if all the guest disk images are
> qcow2, and qemu rather than libvirt does all the work.  However, it has
> a couple of drawbacks: it is inherently tied to domains (there is no way
> to manage snapshots of storage volumes not tied to a domain, even though
> libvirt does that for qcow2 images associated with offline qemu domains
> by using the qemu-img application).  And it necessarily operates on all
> of the images associated with a domain in parallel - if any disk image
> is not qcow2, the snapshot fails, and there is no way to select a subset
> of disks to save.  However, it works on both active (disk and memory
> state) and inactive domains (just disk state).
> 
> Upstream qemu is developing a 'live snapshot' feature, which allows the
> creation of a snapshot without the current downtime of several seconds
> required by the current 'savevm' monitor command, as well as means for
> controlling applications (libvirt) to request that qemu pause I/O to a
> particular disk, then externally perform a snapshot, then tell qemu to
> resume I/O (perhaps on a different file name or fd from the host, but
> with no change to the contents seen by the guest).  Eventually, these
> changes will make it possible for libvirt to create fast snapshots of
> LVM partitions or btrfs files for guest disk images, as well as to

Actually, IIUC, the QEMU 'live snapshot' feature is only for special
disk formats like qcow2, qed, etc.

For formats like LVM, brtfs, SCSI, etc,  libvirt will have todo all
the work of creating the snapshot, possibly then telling QEMU to
switch the backing file of a virtual disk to the new image (if the
snapshot mechanism works that way).

> select which disks are saved in a snapshot (that is, save a
> crash-consistent state of a subset of disks, without the corresponding
> RAM state, rather than making a full system restore point); the latter
> would work best with guest cooperation to quiesce disks before qemu
> pauses I/O to that disk, but that is an orthogonal enhancement.

At the very least, you need a way to store QEMU writing to the disk
for a period of time, whether or not the guest is quiesced. There
are basically 3 options

 1. Pause the guest CPUs (eg  'stop' on the monitor)
 2. QEMU queues I/O from guest in memory temporarily (does not currently exist)
 3. QEMU tells guest to quiesce I/O temporarily (does not currently exist)

To perform a snapshot libvirt would need todo 

 1. Stop I/O using one of the 3 methods above
 2. If disk is a special format
      - Ask QEMU to snapshot it
    Else
      - Create snapshot ourselves
      - Update QEMU disk backing path (optional)
 3. Resume I/O

> However, my first goal with API enhancements is to merely prove that
> libvirt can manage a live snapshot by using qemu-img on a qcow2 image
> rather than the current 'savevm' approach of qemu doing all the work.

FYI, QEMU developers are adament that if the disk image is open
by QEMU you should, in general, not do anything using qemu-img
on that disk image. libvirt does currently do things like querying
disk capacity, but we can get away with that because it is an
invariant section of the header. We certainly can't create internal
snapshots with qemu-img while the guest is live. Creating external
snapshots with qemu-img is probably OK, but when I've suggested
this before QEMU developers were unhappy with even that.

> Additionally, libvirt provides the virDomainSave command, which saves
> just the state of the domain's memory, and stops the guest.  A crude
> libvirt-only snapshot could thus already be done by using virDomainSave,
> then externally doing a snapshot of all disk images associated with the
> domain by using virStorageVol APIs, except that such APIs don't yet
> exist.  Additionally, virDomainSave has no flags argument, so there is
> no way to request that the guest be resumed after the snapshot completes.
> 
> Right now, I'm proposing the addition of virDomainSaveFlags, along with
> a series of virStorageVolSnapshot* APIs that mirror the
> virDomainSnapshot* APIs.  This would mean adding:
> 
> 
> /* Opaque type to manage a snapshot of a single storage volume.  */
> typedef virStorageVolSnapshotPtr;
> 
> /* Create a snapshot of a storage volume.  XML is optional, if non-NULL,
> it would be a new top-level element <volsnapshot> which is similar to
> the top-level <domainsnapshot> for virDomainSnapshotCreateXML, to
> specify name and description. Flags is 0 for now. */
> virStorageVolSnapshotPtr virDomainSnapshotCreateXML(virStorageVolPtr
> vol, const char *xml, unsigned int flags);
> [For qcow2, this would be implemented with 'qemu-img snapshot -c',
> similar to what virDomainSnapshotXML already does on inactive domains.
> Later, we can add LVM and btrfs support, or even allow full file copies
> of any file type.  Also in the future, we could enhance XML to take a
> new element that describes a relationship between the name of the
> original and of the snapshot, in the case where a new filename has to be
> created to complete the snapshot process.]
> 
> 
> /* Probe if vol has snapshots.  1 if true, 0 if false, -1 on error.
> Flags is 0 for now.  */
> int virStorageVolHasCurrentSnapshot(virStorageVolPtr vol, unsigned int
> flags);
> [For qcow2 images, snapshots can be contained within the same file and
> managed with qemu-img -l, but for other formats, this may mean that
> libvirt has to start managing externally saved data associated with the
> storage pool that associates snapshots with filenames.  In fact, even
> for qcow2 it might be useful to support creation of new files backed by
> the previous snapshot rather than cramming multiple snapshots in one
> file, so we may have a use for flags to filter out the presence of
> single-file vs. multiple-file snapshot setups.]
> 
> 
> /* Revert a volume back to the state of a snapshot, returning 0 on
> success.  Flags is 0 for now.  */
> int virStorageVolRevertToSnapsot(virStorageVolSnapshotPtr snapshot,
> unsigned int flags);
> [For qcow2, this would involve qemu-img snapshot -a.  Here, a useful
> flag might be whether to delete any changes made after the point of the
> snapshot; virDomainRevertToSnapshot should probably honor the same type
> of flag.]
> 
> 
> /* Return the most recent snapshot of a volume, if one exists, or NULL
> on failure.  Flags is 0 for now.  */
> virStorageVolSnapshotPtr virStorageVolSnapshotCurrent(virStorageVolPtr
> vol, unsigned int flags);
> 
> 
> /* Delete the storage associated with a snapshot (although the opaque
> snapshot object must still be independently freed).  If flags is 0, any
> child snapshots based off of this one are rebased onto the parent; if
> flags is VIR_STORAGE_VOL_SNAPSHOT_DELETE_CHILDREN , then any child
> snapshots based off of this one are also deleted.  */
> int virStorageVolSnapshotDelete(virStorageVolSnapshotPtr snapshot,
> unsigned int flags);
> [For qcow2, this would involve qemu-img snapshot -d.  For
> multiple-file snapshots, this would also involve qemu-img commit.]
> 
> 
> /* Free the object returned by
> virStorageVolSnapshot{Current,CreateXML,LookupByName}.  The storage
> snapshot associated with this object still exists, if it has not been
> deleted by virStorageVolSnapshotDelete.  */
> int virStorageVolSnapshotFree(virStorageVolSnapshotPtr snapshot);
> 
> 
> /* Return the <volsnapshot> XML details about this snapshot object.
> Flags is 0 for now.  */
> int virStorageVolSnapshotGetXMLDesc(virStorageVolSnapshotPtr snapshot,
> unsigned int flags);
> 
> 
> /* Return the names of all snapshots associated with this volume, using
> len from virStorageVolSnapshotLen.  Flags is 0 for now.  */
> int virStorageVolSnapshotListNames(virStorageVolPtr vol, char **names,
> int nameslen, unsigned int flags);
> [For qcow2, this involves qemu-img -l.  Additionally, if
> virStorageVolHasCurrentSnapshot learns to filter on in-file vs.
> multi-file snapshots, then the same flags would apply here.]
> 
> 
> /* Get the opaque object tied to a snapshot name.  Flags is 0 for now.  */
> virStorageVolSnapshotPtr
> virStorageVolSnapshotLookupByName(virStorageVolPtr vol, const char
> *name, unsigned int flags);
> 
> 
> /* Determine how many snapshots are tied to a volume, or -1 on error.
> Flags is 0 for now.  */
> int virStorageVolSnapshotNum(virStorageVolPtr vol, unsigned int flags);
> [Same flags as for virStorageVolSnapshotListNames.]
> 
> 
> /* Save a domain into the file 'to' with additional actions.  If flags
> is 0, then xml is ignored, and this is like virDomainSave.  If flags
> includes VIR_DOMAIN_SAVE_DISKS, then all of the associated disk images
> are also snapshotted, as if by virStorageVolSnapshotCreateXML; the xml
> argument is optional, but if present, it should be a <domainsnapshot>
> element with <disk> sub-elements for directions on each disk that needs
> a non-empty xml argument for proper volume snapshot creation.  If flags
> includes VIR_DOMAIN_SAVE_RESUME, then the guest is resumed after the
> offline snapshot is complete (note that VIR_DOMAIN_SAVE_RESUME without
> VIR_DOMAIN_SAVE_DISKS makes little sense, as a saved state file is
> rendered useless if the disk images are modified before it is resumed).
>  If flags includes VIR_DOMAIN_SAVE_QUIESCE, this requests that a guest
> agent quiesce disk state before the saved state file is created.  */
> int virDomainSaveFlags(virDomainPtr domain, const char *to, const char
> *xml, unsigned int flags);

What I'm not seeing here, is how these APIs all relate to the existing
support we have in virStorageVol APIs for creating snapshots. THis is
already implemented for LVM, QCow, QCow2. The snapshots are created by
specifying a backing file in the initial volume description. Depending
on the storage type, the backing file for a snapshot can be writable,
or readonly. Snapshots appear as just more storage volumes, and are not
restricted to being within the same pool as the original volume. You can
also mix storage formats, eg, create a Qcow2 volume with backing file
on LVM, which is itself a snapshot of another LVM volume.

The QCow2 internal snapshots don't really fit into our existing model,
since they don't have extra associated external files, so maybe we do
still want some of these explicit APIs to query snapshots against
volumes.

> 
> 
> Also, the existing virDomainSnapshotCreateXML can be made more powerful
> by adding new flags and enhancing the existing XML for <domainsnapshot>.
>  When flags is 0, the current behavior of saving memory state alongside
> all disks (for running domains, via savevm) or just snapshotting all
> disks with default settings (for offline domains, via qemu-img) is kept.
>  If flags includes VIR_DOMAIN_SNAPSHOT_LIVE, then the guest must be
> running, and the new monitor commands for live snapshots are used.  If
> flags includes VIR_DOMAIN_SNAPSHOT_DISKS_ONLY, then only the disks are
> snapshotted (on a running guest, this generally means they will only be
> crash-consistent, and will need an fsck before that disk state can be
> remounted), but it will shave off time by not saving memory.  If flags
> includes VIR_DOMAIN_SNAPSHOT_QUIESCE, then this will additionally
> request that a guest agent quiesce disk state before the live snapshot
> is taken (increasing the likelihood of a stable disk, rather than a
> crash-consistent disk; but it requires cooperation from the guest so it
> is no more reliable than memballoon changes).
> 
> As for the XML changes, it makes sense to snapshot just a subset of
> disks when you only care about crash-consistent state or if you can rely
> on a guest agent to quiesce the subset of disk(s) you care about, so the
> existing <domainsnapshot> element needs a new optional subelement to
> control which disks are snapshotted; additionally, this subelement will
> be useful for disk image formats that require additional complexity
> (such as a secondary file name, rather than the inline snapshot feature
> of qcow2).  I'm envisioning something like the following:
> 
> <domainsnapshot>
>   <name>whatever</name>
>   <disk name='/path/to/image1' snapshot='no'/>
>   <disk name='/path/to/image2'>
>     <volsnapshot>...</volsnapshot>
>   </disk>
> </domainsnapshot>
> 
> where there can be up to as many <disk> elements as there are disk
> <devices> in the domain xml; any domain disk not listed is given default
> treatment.  The name attribute of <disk> is mandatory, in order to match
> this disk element to one of the domain disks.  The snapshot='yes|no'
> attribute is optional, defaulting to yes, in order to skip a particular
> disk.  The <volsnapshot> subelement is optional, but if present, it
> would be the same XML as is provided to the
> virStorageVolSnapshotCreateXML.  [And since my first phase of
> implementation will be focused on inline qcow2 snapshots, I don't yet
> know what that XML will need to contain for any other type of snapshots,
> such as mapping out how the snapshot backing file will be named in
> relation to the possibly new live file.]
> 
> Any feedback on this approach?  Any other APIs that would be useful to
> add?  I'd like to get all the new APIs in place for 0.9.3 with minimal
> qcow2 functionality, then use the time before 0.9.4 to further enhance
> the APIs to cover more snapshot cases but without having to add any new
> APIs.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|