[libvirt] [RFC] Proposed API to support block device streaming

Mon Nov 15 14:24:44 UTC 2010

On 11/15/2010 07:05 AM, Daniel P. Berrange wrote:
>>> Do these calls need to be run before the QEMU process is started,
>>> or after QEMU is already running ?
>>>        
>> Streaming requires a running domain and runs concurrently.
>>      
> What if you have a disk image and want to activate streaming
> without running a VM ? eg, so you can ensure the image is
> fully downloaded to the host and thus avoid a runtime problem
> which would result in IO error for the guest
>    

I hadn't considered off line streaming as a use-case.  Is this more of a 
theoretically consideration or something you would like to see as part 
of the libvirt API?

I'm struggling with understanding the usefulness of it.  If you care 
about streaming offline, you can just do a normal image copy.  It seems 
like this really would only apply to a use-case where you started out 
wanting online streaming, could not complete the streaming, and then 
instead of resuming online streaming, wanted to do offline streaming.

It doesn't seem that practical to me.

>>> If we're streaming the whole disk, is there a way to cancel/abort
>>> it early ?
>>>        
>> I was thinking of adding another mode flag for this:
>> VIR_STREAM_DISK_CANCEL
>>
>>      
>>> What happens if qemu-nbd dies before streaming is complete ?
>>>        
>> Bad things.  Same as if you deleted a qcow2 backing file.
>>      
> So a migration lifecycle based on this design has a pretty
> dangerous failure mode. The guest can loose access to the
> NBD server before the disk copy is complete, and we'd be
> unable to switch back to the original QEMU instance since
> the target has already started dirtying memory which has
> invalidated the source.
>    

Separate out the live migration use-case from the streaming use-case.  
This patch series is just about image streaming.  Here's the expected 
use-case:

I'm a cloud provider and I want to deploy new guests rapidly based on 
template images.  I want the deployed image to reside on local storage 
for the deployed node to avoid excessive network traffic (with high node 
density, the network becomes the bottleneck).

My options today are:

1) Copy the image to the new node.  This infers a huge upfront cost with 
respect to time.  In a cloud environment, rapid provisioning is very 
important so this is a major issue.

2) Use shared storage for the template images and then create a 
copy-on-write image on local storage.  This enables rapid provisioning 
but still uses the network for data reads.  This also requires that the 
template images stay around forever or that you have complicated 
management support for tracking which template images are still in use.

With image streaming, you get rapid provisioning as in (2) but you also 
get to satisfy reads from local storage eliminating pressure on the 
network.  Since streaming gives you a deterministic period where the 
copy-on-write image depends on the template image, it also simplifies 
template image tracking.

In terms of points of failure, image streaming is a bit better than (2) 
because it has two points of failure for a deterministic period of time.

>> This would be for the block-migration workflow...  I can't see any
>> particular problem with running qemu-nbd as a regular user.  That's how
>> I do it when testing.
>>      
> These last few points are my biggest concern with the API. If we
> iteratively add a bunch of APIs for each piece of functionality
> involved here, then we'll end up with a migration lifecycle that
> requires the app to know about invoking 10's of different API
> calls in a perfect sequence. This seems like a very complex and
> fragile design for apps to have to deal with.
>    

Migration is a totally different API.  This particular API is focused 
entirely on streaming.  It should not be recommended that it's used to 
enable live migration (even though it's technically possible).

For live migration, I think we really have to look more carefully at the 
libvirt API.  To support post-copy migration in a robust fashion, we 
need to figure out how we want to tunnel the traffic, provide an 
interface to select which devices to migrate, etc.

> If we want to be able to use this functionality without requiring
> apps to have a direct shell into the host, then we need a set of
> APIs for managing NBD server instances for migration, which is
> another level of complexity.
>
> A simpler architecture would be to have the NBD server embedded
> inside the source QEMU VM, and tunnel the NBD protocol over the
> existing migration socket. So QEMU would do a normal migration
> of RAM, and when that completes and source QEMU CPUs are stopped,
> but QEMU is left running to continue serving the disk data.
> This avoids any extra network connections, and avoids having to
> add any new APIs to manage NBD servers, and avoids all the
> security driver&  lock manger integration problems that the latter
> will involve.  If it is critical to free up RAM on the source
> host, then the main VM ram area can be munmap()d on the source
> once main migration completes, since its not required for the
> ongoing NBD data stream.  This kind of architecture means that
> apps would need near zero knowledge of disk streaming to make
> use of it. The existing virDomainMigrate() would be sufficient,
> with an extra flag to request post-migration streaming. There
> would still be a probable need for your suggested API to force
> immediate streaming of a disk, instead of relying on NBD, but
> most apps wouldn't have to care about that if they didn't want
> to.
>
> In summary though, I'm not inclined to proceed with adding ad-hoc
> APIs for disk streaming to libvirt, without fully considering
> the design of a full migration+disk streaming architecture.
>    

Migration is an orthogonal discussion.

In the streaming model, the typical way to support a base image is not 
nbd but NFS.

Streaming is a very different type of functionality than migration and 
trying to lump it together would create an awful lot of user confusion IMHO.

Regards,

Anthony Liguori

> Regards,
> Daniel
>