[libvirt] [RFC] Image Fleecing for Libvirt (BZ 955734, 905125)

Wed Jul 24 04:22:23 UTC 2013

On 07/15/2013 03:04 PM, Richard W.M. Jones wrote:
> On Mon, Jul 15, 2013 at 05:57:12PM +0800, Fam Zheng wrote:
>> Hi all,
>>
>> QEMU-KVM BZ 955734, and libvirt BZ 905125 are about feature "Read-only
>> point-in-time throwaway snapshot". The development is ongoing on
>> upstream, which implements the core functionality by QMP command
>> drive-backup. I want to demonstrate the HMP/QMP commands here for image
>> fleecing tasks (again) and make sure this interface looks ready and
>> satisfying from Libvirt point of view.

I'm wondering if we can still get something committed in time for the
freeze for 1.1.1.  At this point, we're close enough to the freeze, and
with no patches submitted in libvirt and the qemu design still under
discussion, that I'm worried about whether we are rushing things too
much to take a new interface this late in a libvirt release cycle, or
whether we should wait until after 1.1.1 before attempting to add
things.  On the other hand, if we can agree on a sane design now (or at
least before rc2, if we miss rc1), then we can commit to that design for
this libvirt release, and downstream distros can use libvirt 1.1.1 as a
starting point for rebases without worrying about so-name compatibility,
by signing up to the efforts of backporting actual implementation from
future upstream qemu and libvirt releases.

We've done the approach of an early commit to a new API in the past,
even if I'm not necessarily the biggest fan of the approach.  For
example, we chose to add virDomainBlockRebase to libvirt 0.9.10 (commit
9f902a2, when qemu 1.0 was current) as a way to expose more
functionality than what virDomainBlockPull supported, even though we
didn't actually implement new functionality until libvirt 1.0.0 and qemu
1.3 (commit c1eb380).  The libvirt API design was sound enough that I
was able to drive the eventual qemu implementation without any problems,
and where the implementation could be backported without so-name bump
all the way to 0.9.10.

I do want to emphasize that both image fleecing and point-in-time
snapshots are features that people want.  At the same time, today's
qemu.git does not yet have all the patches in place, and we are past
soft freeze for qemu 1.6, so there may be a bit of a debate on the qemu
list on what aspects of the proposed patches to take, or even a decision
that it is too controversial and will wait until qemu 1.7 before being
in upstream qemu.  Historically, we are reluctant to add implementations
to upstream libvirt until the corresponding qemu feature is fully-baked
upstream; and leave it to distro backporters to decide if the feature is
important enough to backport onto whatever earlier version they base
their distro on.  At the same time, distro backporters have more
flexibility with pulling changes that do not require a so-name bump, and
I'm fairly confident that we need a new libvirt API to drive the
features, so if we want to support a distro using libvirt 1.1.1, then we
need to settle on the libvirt API now even if it remains unimplemented
for another libvirt release.

Also, in the past, I have posted proposed API for virDomainBlockCopy()
[1], but left it unimplemented in upstream libvirt in case future qemu
came up with more options that would need tweaking.  At this point in
time, now that qemu is talking both about adding point-in-time snapshots
(block-backup) and image fleecing, I think the time is right to commit
to an API for virDomainBlockCopy().

[1]https://www.redhat.com/archives/libvir-list/2012-April/msg00632.html

>>
>> We get cheap point-in-time snapshot, and export it through built in NBD
>> server, by commands described below:
>>
>>  1. qemu-img create -f qcow2 -o backing_file=RUNNING-VM.img BACKUP.qcow2
>>
>>     (although the backing_file option is not honoured in the next step
>>     because we *override* backing file with an existing
>>     BlockDriverState, giving it here does no harm and also makes sure
>>     the created image is of right size.)

Use of qemu-img while the file is also owned by a running qemu is
dangerous, we'd need the equivalent of this command to be supported from
within qemu, or else create the destination without naming a backing
file and follow up with something like qemu-img rebase -u to plug in the
metadata of what the eventual backing file name will be, all without
ever opening the backing file externally.  But that's low-level
implementation, and shouldn't affect the design of a libvirt API.

>>
>>  2. (HMP) drive_add backing=ide0-hd0,file=BACKUP.qcow2,id=target0,if=none
>>
>>     (where ide0-hd0 is the running BlockDriverState name for
>>     RUNNING-VM.img)

Whether this is done with HMP, or a QMP command gets added in time, is
also a low-level detail.

>>
>>  3. (QMP) drive-backup device=ide0-hd0 mode=drive sync=none target=target0
>>
>>     (NewImageMode 'drive' means target is looked up as a device id, sync
>>     mode 'none' means don't copy any data except copy-on-write the
>>     point in time snapshot data)
>>
>>  4. (QMP) nbd-server-add device=target0
>>
>> When image fleecing done:
>>
>>  1. (QMP) block-job-complete device=ide0-hd0
>>
>>  2. (HMP) drive_del target0
>>
>>  3. rm BACKUP.qcow2
>>
>> Note: HMP drive_add/drive_del has no counterpart in QMP now but a new
>> command blockdev-add to do similar things is WIP, which can be an
>> alternative in QMP flavor.

The earlier design I mentioned for virDomainBlockCopy in 2012 would only
work on only one disk at a time; a user could start multiple block jobs,
but would have to coordinate them by hand.  Paolo's reply to this thread
suggested an interface that took a list of block devices, rather than
one, and guarantees that the point in time semantic applies to all the
devices at once.  Unfortunately, the current libvirt block job semantics
are tied to a single disk (virDomainBlockStats, virDomainBlockJobAbort),
so if we want to manage multiple disks at a common point in time, it
sounds more like we'd want to treat this as a generic domain job id
rather than a libvirt block job (virDomainGetJobStats,
virDomainAbortJob).  On the other hand, virDomainAbortJob is hard-wired
to a single background job at a time; but with image fleecing, we
definitely want to support multiple clients fleecing from different
points in time simultaneously, which would imply having a job id.

Therefore, I'm worried that properly supporting this will involve the
addition of multiple API; adding just a super-power virDomainBlockCopy()
does not give us as much control as what I think we want.

It's late for me, and I know DV wants to cut rc1, but I hope this sparks
some conversations, and that we can decide on whether we need to pursue
the idea of supporting API for image fleecing as part of libvirt 1.1.1,
or whether we punt and state that there is just too much design work
still in the state of flux.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 621 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/libvir-list/attachments/20130723/2e5e4a33/attachment-0001.sig>