[libvirt] [RFC] Proposed API to support block device streaming

Mon Nov 15 13:05:21 UTC 2010

On Wed, Nov 10, 2010 at 08:45:20AM -0600, Adam Litke wrote:
> On Wed, 2010-11-10 at 11:33 +0000, Daniel P. Berrange wrote:
> > On Tue, Nov 09, 2010 at 03:17:23PM -0600, Adam Litke wrote:
> > > I've been working with Anthony Liguori and Stefan Hajnoczi to enable data
> > > streaming to copy-on-read disk images in qemu.  This work is working its way
> > > through peer review and I expect it to be upstream soon as part of the support
> > > for the new QED disk image format.
> > > 
> > > I would like to enable these commands in libvirt in order to support at least
> > > two compelling use cases:
> > > 
> > > 1) Rapid deployment of domains:
> > > Creating a new domain from a central repository of images can be time consuming
> > > since a local copy of the image must be made before the domain can be started.
> > > With copy-on-read and streaming, up-front copy time is eliminated and the
> > > domain can be started immediately.  Streaming can run while the domain runs
> > > to fully populate the disk image.
> > > 
> > > 2) Post-copy live block migration:
> > > A qemu-nbd server is started on the source host and serves the domain's block
> > > device to the destination host.  A QED image is created on the destination host
> > > with backing to the nbd server.  The domain is migrated as normal.  When
> > > migration completes, a stream command is executed to fully populate the
> > > destination QED image.  After streaming completes, the qemu-nbd server can
> > > be shut down and the domain (including local storage) is fully independent of
> > > the source host.
> > > 
> > > Qemu will support two streaming modes: full device and single sector.  Full
> > > device streaming is the easiest to use because one command will cause the whole
> > > device to be streamed as fast as possible.  Single sector mode can be used if
> > > one wants to throttle streaming to reduce I/O pressure.  In this mode, the user
> > > issues individual commands to stream single sectors.
> > > 
> > > To enable this support in libvirt, I propose the following API...
> > > 
> > > virDomainStreamDisk() initiates either a full device stream or a single sector
> > > stream (depending on virDomainStreamDiskFlags).  For a full device stream, it
> > > returns either 0 or -1.  For a single sector stream, it returns an offset that
> > > can be used to continue streaming with a subsequent call to virDomainStreamDisk().
> > > 
> > > virDomainStreamDiskInfo() returns the status of a currently-running full device
> > > stream (the device name, current streaming position, and total size).
> > > 
> > > Comments on this design would be greatly appreciated.  Thanks!
> > 
> > I'm finding it hard to say whether these APIs are suitable or not
> > because I can't see what this actually maps to in terms of
> > implementation. 
> 
> Please see the qemu driver piece that I will post as a reply to this
> email.  Since I am not looking for any particular code review at this
> point I decided not to post the whole series.  But I would be happy to
> do so.

I'm not too worried about the code, I just wanted to understand  what
logical set of QEMU operations it maps to.

> > Do these calls need to be run before the QEMU process is started,
> > or after QEMU is already running ?
> 
> Streaming requires a running domain and runs concurrently.

What if you have a disk image and want to activate streaming
without running a VM ? eg, so you can ensure the image is
fully downloaded to the host and thus avoid a runtime problem
which would result in IO error for the guest

> > Does the path in the arg actually need to exist on disk before 
> > streaming begins, or do these APIs create the image too ?
> 
> The path actually refers to the alias of the currently attached disk
> (which must be a copy-on-read disk).  For example: 'drive-virtio-disk0'.
> When started, the stream command will populate the local image file with
> blocks from the backing file until the local file is complete and the
> backing_file link can be broken.

NB, libvirt intentionally doesn't expose the device backend
aliases in the API. So this should refer to the device
alias which is included in the XML.

> > If we're streaming the whole disk, is there a way to cancel/abort 
> > it early ? 
> 
> I was thinking of adding another mode flag for this:
> VIR_STREAM_DISK_CANCEL
> 
> > What happens if qemu-nbd dies before streaming is complete ? 
> 
> Bad things.  Same as if you deleted a qcow2 backing file.

So a migration lifecycle based on this design has a pretty
dangerous failure mode. The guest can loose access to the
NBD server before the disk copy is complete, and we'd be
unable to switch back to the original QEMU instance since
the target has already started dirtying memory which has
invalidated the source.

> 
> > Who/what starts the qemu-nbd process ?
> 
> This API doesn't yet implement any kind of migration workflow (but that
> is next on my plate).  As currently designed, an external entity would
> prepare nbd server on the source machine and create the target block
> device on the destination host (linked to the nbd server).  Once these
> two things are set up, the normal libvirt migration workflow can be
> used.  On the destination machine, the stream command would then be used
> to expediently remove the domain's dependency on the nbd-served base
> image.
> 
> > If you have a guest on host A and want to migrate to host B, we presumably
> > need to start qemu-nbd on host A, while the guest is still running on
> > host A. eg we end up with 2 processes having the same disk image open on
> > host A for a while.
> 
> Yes.
> 
> > How we'd wire qemu-nbd up into the security driver framework is of 
> > particular concern here, because I'd think we'd want qemu-nbd to run
> > wit hthe same privileges as the qemu, so that its isolated from all
> > other QEMU processes on the host and can only access the one set of
> > disks  for that VM
> 
> This would be for the block-migration workflow...  I can't see any
> particular problem with running qemu-nbd as a regular user.  That's how
> I do it when testing.

These last few points are my biggest concern with the API. If we
iteratively add a bunch of APIs for each piece of functionality
involved here, then we'll end up with a migration lifecycle that
requires the app to know about invoking 10's of different API
calls in a perfect sequence. This seems like a very complex and
fragile design for apps to have to deal with.

Direct QEMU<->QEMU migration is already sub-optimal in that it
requires opening many ports in the firewall (assuming you want
to allow multiple concurrent VMs to migrate). We can address
that limitation by having libvirt take ownership of the port on
the destination hosts, and then pass the incoming client socket
onto QEMU, or manually forward traffic. Adding in multiple NBD
network sockets makes the firewall management problem even 
worse. 

If we want to be able to use this functionality without requiring
apps to have a direct shell into the host, then we need a set of
APIs for managing NBD server instances for migration, which is
another level of complexity. 

A simpler architecture would be to have the NBD server embedded
inside the source QEMU VM, and tunnel the NBD protocol over the
existing migration socket. So QEMU would do a normal migration
of RAM, and when that completes and source QEMU CPUs are stopped,
but QEMU is left running to continue serving the disk data. 
This avoids any extra network connections, and avoids having to
add any new APIs to manage NBD servers, and avoids all the 
security driver & lock manger integration problems that the latter
will involve.  If it is critical to free up RAM on the source
host, then the main VM ram area can be munmap()d on the source
once main migration completes, since its not required for the
ongoing NBD data stream.  This kind of architecture means that
apps would need near zero knowledge of disk streaming to make
use of it. The existing virDomainMigrate() would be sufficient,
with an extra flag to request post-migration streaming. There
would still be a probable need for your suggested API to force
immediate streaming of a disk, instead of relying on NBD, but
most apps wouldn't have to care about that if they didn't want
to.

In summary though, I'm not inclined to proceed with adding ad-hoc
APIs for disk streaming to libvirt, without fully considering
the design of a full migration+disk streaming architecture.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|