[libvirt] [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

Sun Sep 12 13:40:23 UTC 2010

  On 09/12/2010 03:25 PM, Anthony Liguori wrote:
> On 09/12/2010 07:41 AM, Avi Kivity wrote:
>>  On 09/07/2010 05:57 PM, Anthony Liguori wrote:
>>>> I agree that streaming should be generic, like block migration.  The
>>>> trivial generic implementation is:
>>>>
>>>> void bdrv_stream(BlockDriverState* bs)
>>>> {
>>>>      for (sector = 0; sector<  bdrv_getlength(bs); sector += n) {
>>>>          if (!bdrv_is_allocated(bs, sector,&n)) {
>>>
>>> Three problems here.  First problem is that bdrv_is_allocated is 
>>> synchronous. 
>>
>> Put the whole thing in a thread.
>
> It doesn't fix anything.  You don't want stream to serialize all I/O 
> operations.

Why would it serialize all I/O operations?  It's just like another vcpu 
issuing reads.

>
>>> The second problem is that streaming makes the most sense when it's 
>>> the smallest useful piece of work whereas bdrv_is_allocated() may 
>>> return a very large range.
>>>
>>> You could cap it here but you then need to make sure that cap is at 
>>> least cluster_size to avoid a lot of unnecessary I/O.
>>
>> That seems like a nice solution.  You probably want a multiple of the 
>> cluster size to retain efficiency.
>
> What you basically do is:
>
> stream_step_three():
>    complete()
>
> stream_step_two(offset, length):
>    bdrv_aio_readv(offset, length, buffer, stream_step_three)
>
> bdrv_aio_stream():
>     bdrv_aio_find_free_cluster(stream_step_two)

Isn't there a write() missing somewhere?

>
> And that's exactly what the current code looks like.  The only change 
> to the patch that this does is make some of qed's internals be block 
> layer interfaces.

Why do you need find_free_cluster()?  That's a physical offset thing.  
Just write to the same logical offset.

IOW:

     bdrv_aio_stream():
         bdrv_aio_read(offset, stream_2)

     stream_2():
         if all zeros:
             increment offset
             if more:
                 bdrv_aio_stream()
        bdrv_aio_write(offset, stream_3)

     stream_3():
         bdrv_aio_write(offset, stream_4)

     stream_4():
         increment offset
         if more:
              bdrv_aio_stream()

Of course, need to serialize wrt guest writes, which adds a bit more 
complexity.  I'll leave it to you to code the state machine for that.

>
> One of the things Stefan has mentioned is that a lot of the QED code 
> could be reused by other formats.  All formats implement things like 
> CoW on their own today but if you exposed interfaces like 
> bdrv_aio_find_free_cluster(), you could actually implement a lot more 
> in the generic block layer.
>
> So, I agree with you in principle that this all should be common 
> code.  I think it's a larger effort though.

Not that large I think; and it will make commit async as a side effect.

>>>
>>> The QED streaming implementation is 140 LOCs too so you quickly end 
>>> up adding more code to the block formats to support these new 
>>> interfaces than it takes to just implement it in the block format.
>>
>> bdrv_is_allocated() already exists (and is needed for commit), what 
>> else is needed?  cluster size?
>
> Synchronous implementations are not reusable to implement asynchronous 
> anything. 

Surely this is easy to fix, at least for qed.

What we need is thread infrastructure that allows us to convert between 
the two methods.

> But you need the code to be cluster aware too.

Yes, another variable in BlockDriverState.

>
>>> Third problem is that  streaming really requires being able to do 
>>> zero write detection in a meaningful way.  You don't want to always 
>>> do zero write detection so you need another interface to mark a 
>>> specific write as a write that should be checked for zeros.
>>
>> You can do that in bdrv_stream(), above, before the actual write, and 
>> call bdrv_unmap() if you detect zeros.
>
> My QED branch now does that FWIW.  At the moment, it only detects zero 
> reads to unallocated clusters and writes a special zero cluster 
> marker.  However, the detection code is in the generic path so once 
> the fsck() logic is working, we can implement a free list in QED.
>
> In QED, the detection code needs to have a lot of knowledge about 
> cluster boundaries and the format of the device.  In principle, this 
> should be common code but it's not for the same reason copy-on-write 
> is not common code today.

Parts of it are: commit.  Of course, that's horribly synchronous.

-- 
error compiling committee.c: too many arguments to function