[libvirt] [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

Tue Sep 7 15:09:43 UTC 2010

On Tue, Sep 7, 2010 at 4:00 PM, Anthony Liguori
<aliguori at linux.vnet.ibm.com> wrote:
> On 09/07/2010 09:55 AM, Stefan Hajnoczi wrote:
>>
>> On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori
>> <aliguori at linux.vnet.ibm.com>  wrote:
>>
>>>
>>> On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
>>>
>>>>
>>>> On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori
>>>> <aliguori at linux.vnet.ibm.com>    wrote:
>>>>
>>>>
>>>>>
>>>>> The interface for copy-on-read is just an option within qemu-img
>>>>> create.
>>>>>  Streaming, on the other hand, requires a bit more thought.  Today, I
>>>>> have a
>>>>> monitor command that does the following:
>>>>>
>>>>> stream<device>    <sector offset>
>>>>>
>>>>> Which will try to stream the minimal amount of data for a single I/O
>>>>> operation and then return how many sectors were successfully streamed.
>>>>>
>>>>> The idea about how to drive this interface is a loop like:
>>>>>
>>>>> offset = 0;
>>>>> while offset<    image_size:
>>>>>   wait_for_idle_time()
>>>>>   count = stream(device, offset)
>>>>>   offset += count
>>>>>
>>>>> Obviously, the "wait_for_idle_time()" requires wide system awareness.
>>>>>  The
>>>>> thing I'm not sure about is 1) would libvirt want to expose a similar
>>>>> stream
>>>>> interface and let management software determine idle time 2) attempt to
>>>>> detect idle time on it's own and provide a higher level interface.  If
>>>>> (2),
>>>>> the question then becomes whether we should try to do this within qemu
>>>>> and
>>>>> provide libvirt a higher level interface.
>>>>>
>>>>>
>>>>
>>>> A self-tuning solution is attractive because it reduces the need for
>>>> other components (management stack) or the user to get involved.  In
>>>> this case self-tuning should be possible.  We need to detect periods
>>>> of I/O inactivity, for example tracking the number of in-flight
>>>> requests and then setting a grace timer when it reaches zero.  When
>>>> the grace timer expires, we start streaming until the guest initiates
>>>> I/O again.
>>>>
>>>>
>>>
>>> That detects idle I/O within a single QEMU guest, but you might have
>>> another
>>> guest running that's I/O bound which means that from an overall system
>>> throughput perspective, you really don't want to stream.
>>>
>>> I think libvirt might be able to do a better job here by looking at
>>> overall
>>> system I/O usage.  But I'm not sure hence this RFC :-)
>>>
>>
>> Isn't this what block I/O controller cgroups is meant to solve?  If
>> you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then
>> vm-1 can do streaming without eating into vm-2's guaranteed bandwidth.
>>
>
> That assumes you're capping I/O.  But sometimes you care about overall
> system throughput more than you care about any individual VM.
>
> Another way to look at it may be, a user waits for a cron job that runs at
> midnight and starts streaming as necessary.  However, the user wants to be
> able to interrupt the streaming should there been a sudden demand.
>
> If the user drives the streaming through an interface like I've specified,
> they're in full control.  It's pretty simple to build a interfaces on top of
> this that implement stream as an aggressive or conservative background task
> too.
>
>>  Also, I'm not sure we should worry about the priority of the I/O too
>> much: perhaps the user wants their vm to stream more than they want an
>> unimportant local vm that is currently I/O bound to have all resources
>> to itself.  So I think it makes sense to defer this and not try for
>> system-wide knowledge inside a QEMU process.
>>
>
> Right, so that argues for an incremental interface like I started with :-)
>
> BTW, this whole discussion is also relevant for other background tasks like
> online defragmentation so keep that use-case in mind too.

Right, I'm a little hesitant to get too far into discussing the
management interface because I remember long threads about polling and
async.  I never fully read them but I bet some wisdom came out of them
that applies here.

There are two ways to do a long running (async?) task:
1. Multiple smaller pokes.  Perhaps completion of a single poke is
async.  But the key is that the interface is incremental and driven by
the management stack.
2. State.  Turn on streaming and watch it go.  You can find out its
current state using another command which will tell you whether it is
enabled/disabled and progress.  Use a command to disable it.

Stefan