[libvirt] RFCv2: virDomainSnapshotCreateXML enhancements

Thu Aug 11 13:23:03 UTC 2011

On 08/11/2011 04:00 AM, Kevin Wolf wrote:
> Am 11.08.2011 00:08, schrieb Eric Blake:
>> After re-reading the feedback received on those threads, I think I've
>> settled on a pretty robust design for my first round of adding
>> improvements to the management of snapshots tied to a domain, while
>> leaving the door open for future extensions.
>>
>> Sorry this email is so long (I've had it open in my editor for more than
>> 48 hours now as I keep improving it), but hopefully it is worth the
>> effort to read.  See the bottom if you want the shorter summary on the
>> proposed changes.
>
> It was definitely a good read, thanks for writing it up.

Thanks for taking the time to read it.

>
> Of course, I'm not really familiar with libvirt (now a bit more than
> before :-)), so all my comments are from a qemu developer perspective.
> Some of them may look like stupid questions or turn out to be
> misunderstandings, but I hope it's still helpful for you to see how qemu
> people understand things.

Your comments are indeed helpful.

>> internal disk snapshot: a disk snapshot where both the saved state and
>> delta reside in the same file (possible with qcow2 and qed).  If a disk
>> image is not in use by qemu, this is possible via 'qemu-img snapshot -c'.
>
> QED doesn't support internal snapshots.

Good to know.  That means that 'savevm' (checkpoint, internal, internal) 
is indeed a qcow2-only feature.  It also means that libvirt should 
reject attempts to mix snapshot='internal' and qed.

>
>> external disk snapshot: a disk snapshot where the saved state is one
>> file, and the delta is tracked in another file.  For a disk image not in
>> use by qemu, this can be done with qemu-img to create a new qcow2 file
>> wrapping any type of existing file.  Recent qemu has also learned the
>> 'snapshot_blkdev' monitor command for creating external snapshots while
>> qemu is using a disk, and the goal of this RFC is to expose that
>> functionality from within existing libvirt APIs.
>>
>> saved state: all non-disk information used to resume a guest at the same
>> state, assuming the disks did not change.  With qemu, this is possible
>> via migration to a file.
>
> Is this terminology already used in libvirt? In qemu we tend to call it
> the VM state.

"VM state" is a bit nicer than "saved state", so I'll keep that in mind 
as I actually write the code and put comments in.  Right now, I believe 
'man virsh' favors saved state (after all, this is what you get with the 
'virsh save' and 'virsh managedsave' commands).

>> Libvirt currently has a bug in that it only saves<domain>/<uuid>  rather
>> than the full domain xml along with a checkpoint - if any devices are
>> hot-plugged (or in the case of offline snapshots, if the domain
>> configuration is changed) after a snapshot but before the revert, then
>> things will most likely blow up due to the differences in devices in use
>> by qemu vs. the devices expected by the snapshot.
>
> Offline snapshot means that it's only a disk snapshot, so I don't think
> there is any problem with changing the hardware configuration before
> restoring it.
>
> Or does libvirt try to provide something like offline checkpoints, where
> restoring would not only restore the disk but also roll back the libvirt
> configuration?

Reverting to an offline checkpoint _should_ be reverting the libvirt 
configuration back to match (so if you make an offline checkpoint, then 
add a drive, then revert to the checkpoint, the added drive should be 
gone).  The bug I was referring to is that libvirt does not currently 
track enough information to properly do the rollback, and that one of 
the goals of this RFC is to fix that shortcoming.

>
> How do you decide whether to use internal or external snapshots? Should
> this be another flag? In fact we have multiple dimensions:

There are indeed multiple dimensions, as well as two parameters which 
can affect the dimensions (an xml argument means that we can add 
arbitrary additional xml constructs to cram in as much additional 
tweaking we can dream of for the command, and a flags argument is 
limited to at most 32 binary decisions).  I hope I'm making the right 
tradeoff in using flags vs. xml.

>
> * Disk snapshot or checkpoint? (you have a flag for this)

Correct - my rationale here is that the <state> sublement of the 
<domainsnapshot> is currently output-only (it is ignored on 
virDomainSnapshotCreateXML), but accurately tracks which way the flag 
was chosen (state will be one of offline|running|paused if a checkpoint, 
or disk-snapshot if a disk snapshot); making this choice via flag meant 
that I did not have to change the snapshot creation to parse the xml to 
make the decision.

> * Disk snapshot stored internally or externally (missing)

XML does this - my rationale is that this is not a binary decision, but 
requires as much arbitrary information as there are disks in the domain. 
  Therefore, the solution must involve new xml elements; I added <disks> 
with multiple <disk> subelements to detail the decision for each disk.

> * VM state stored internally or externally (missing)

Correct that this is missing for now.  For qemu, you can either store VM 
state externally (migrate to file, used by 'virsh save' and 'virsh 
managedsave') but have no disk snapshots, or you can store VM state 
internally (savevm, used by 'virsh snapshot-create'), but with no way 
avoid disk snapshots.  It would indeed be possible for future extensions 
to make virDomainSnapshotCreateXML become a superset of virDomainSave at 
creating external VM state, or even to make virDomainSaveFlags (which 
also takes an xml argument) be enhanced to make both the external 
snapshot and the disk snapshots), but I'm leaving that for future 
extensions and focusing on snapshot_blkdev integration at the present.

>
> qemu currently only supports (disk, ext), (disk, int), (checkpoint, int,
> int). But other combinations could be made possible in the future, and I
> think especially the combination (checkpoint, int, ext) could be
> interesting.

Indeed, and I think that we still have room to expand in this direction; 
just as <domainsnapshot> is now learning <disks> for whether each disk 
should be internal or external, we could also teach it a new subelement 
<vmstate> (or some nicer spelling) for controlling whether the VM state 
is internal (and if so, in which qcow2 image) or external.

>
> [ Okay, some of it is handled later in this document, but I think it's
> still useful to leave this summary in my mail. External VM state is
> something that you don't seem to have covered yet - can't we do this
> already with live migration to a file? ]

Yes, external VM state is already covered with live migration to file, 
and I will not be touching it while implementing this RFC, but future 
extensions may be able to further unify the two concepts.

>> libvirt can be taught to honor persistent=no for qemu by creating a
>> qcow2 wrapper file prior to starting qemu, then tearing down that
>> wrapper after the fact, although I'll probably leave that for later in
>> my patch series.
>
> qemu can already do this with -drive snapshot=on. It must be allowed to
> create a temporary file for this to work, of course. Is that a problem?
> If not, I would just forward the option to qemu.

Where would that file be created?  If the main image is in a directory, 
is the temporary would also live in that directory (shared storage 
visible to another qemu for migration purposes) or in local storage 
(preventing migration)?  If migration is possible, would libvirt need to 
be able to learn the name of the temporary file so as to tell the new 
qemu on the destination the same temporary file name it should open?

What about if the main image is a block device - there, the temporary 
file obviously has to live somewhere else, but how does qemu decide 
where, and should that decision be configurable by the user?  How will 
things interact with SELinux labeling?  What about down the road when we 
add enhancements to enforce that qemu cannot open() files on NFS, but 
must instead receive fds by inheritance?

This certainly sounds like some fertile ground for design decisions on 
how libvirt and qemu should interact; I don't know if -drive snapshot=on 
is reliable enough for use by libvirt, or whether libvirt will end up 
having to manage things itself.

Obviously, my implementation of this RFC will start simple, by rejecting 
persistent=no for qemu, until we've answered some of those other design 
questions; I can get snapshot_blkdev support working before we have to 
tackle this enhancement.

>> The other thing to be aware of is that with internal snapshots, qcow2
>> maintains a distinction between current state and a snapshot - that is,
>> qcow2 is _always_ tracking a delta, and never modifies a named snapshot,
>> even when you use 'qemu-img snapshot -a' to revert to different snapshot
>> names.  But with named files, the original file now becomes a read-only
>> backing file to a new active file; if we revert to the original file,
>> and make any modifications to it, the active file that was using it as
>> backing will be corrupted.  Therefore, the safest thing is to reject any
>> attempt to revert to any snapshot (whether checkpoint or disk snapshot)
>> that has an existing child snapshot consisting of an external disk
>> snapshot.  The metadata for each of these children can be deleted
>> manually, but that requires quite a few API calls (learn how many
>> children exist, get the list of children, and for each child, get its
>> xml to see if that child has the target snapshot as a parent, and if so
>> delete the snapshot).  So as shorthand, virDomainRevertToSnapshot will
>> be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which
>> first deletes any children of the snapshot about to be deleted prior to
>> reverting to that particular child.
>
> I think the API should make it possible to revert to a given external
> snapshot without deleting all children, but by creating another qcow2
> file that uses the same backing file. Basically this new qcow2 file
> would be the equivalent to the "current state" concept qcow2 uses for
> internal snapshots.

Interesting idea.  But I'm not quite sure how to fit it into existing API.

Remember, existing API is that you have an existing file name, and when 
you call snapshot_blkdev, you are specifying a new file name that 
becomes the live qcow2 file, rendering the previous file name as the 
snapshot.  So:

<disk name='vda' snapshot='external'>
   <source file='/path/to/new'/>
</disk>

in the <domainsnapshot> is naming the new active file name, not the 
snapshot.  If we go with that representation, then reverting to the 
snapshot means that you want to re-create a new qcow2 file for new 
active state, but what do we call it?  We can't call it 
/path/to/original (since we want to reuse that as the backing file to 
both branches in the snapshot hierarchy), and we can't call it 
/path/to/new from the xml naming unless we get rid of the existing copy 
of /path/to/new.  I see two options for future enhancements, but neither 
has to be implemented right away (that is, this RFC is fine limiting the 
reversion to a disk-snapshot to only occur when there are no 
descendants, as long as we can later relax that restriction in the 
future once we figure out how to do branched descendants).

>
> It should be possible to make both look the same to users if we think
> this is a good idea.

1. As a user, I'd much rather have an interface where _I_ decide the 
name of the snapshot, but keep the active file name unchanged.  That is, 
the current semantics of snapshot_blkdev feel a bit backward (it 
requires me to tell you the name of the new active file, and the 
existing name becomes the snapshot), where I would naively expect to 
have a mode where I tell you the name to rename() the existing file 
into, at which point you then recreate the original name as a active 
qcow2 file that has the new snapshot name as its backing file.  But I'm 
not entirely sure how that would play with SELinux permissions.  Also, 
while rename() works for files, it is lousy for the case of the original 
name being a block device which can't be rename()'d, so I think the 
current snapshot_blkdev semantics are correct even if they feel a bit 
backwards.  But it would be nice if we could design future qemu 
enhancements that would allow the creation of a snapshot of arbitrary 
name while keeping the live file name unchanged.

2. It is possible to add a new libvirt API, virDomainSnapshotCreateFrom, 
which takes an existing snapshot as a child of the given snapshot passed 
in as its parent.  This would combine the action of reverting to a 
disk-snapshot along with the xml argument necessary for naming a new 
live file, so that you could indeed support branching off the 
disk-snapshot with a user-specified or libvirt-generated new active file 
name without having to delete the existing children that were branched 
off the old active file name, and making the original base file the 
backing file to both branches.  Unfortunately, adding a new API is out 
of the question for backporting purposes.

2a. But thinking about it a bit more, maybe we don't need a new API, but 
just an XML enhancement to the existing virDomainSnapshotCreateXML! 
That is, if I specify:
<domainsnapshot>
   <name>branched</name>
   <parent>
     <name>disk-snapstho</name>
   </parent>
   <disks>...</disks>
</domainsnapshot>

then we can accomplish your goal, without any qemu changes, and without 
any new libvirt API.  That is, right now, <parent> is an output-only 
aspect of snapshot xml, but by allowing it to be an input element 
(probably requiring the use of a new flag, 
VIR_DOMAIN_SNAPSHOT_CREATE_BRANCH), then it is possible to both revert 
to the state of the old snapshot and specify the new file name to use to 
collect the branched delta data from that point in time.  It also means 
that creation of a branched snapshot would have to learn some of the 
same flags as reverting to a snapshot (can you create the branch as well 
as run a new qemu process?)  I'll play with the ideas, once I get the 
groundwork of this RFC done first.

Thanks for forcing me to think about it!

-- 
Eric Blake   eblake at redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org