[libvirt] [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
Avi Kivity
avi at redhat.com
Sun Sep 12 13:40:23 UTC 2010
On 09/12/2010 03:25 PM, Anthony Liguori wrote:
> On 09/12/2010 07:41 AM, Avi Kivity wrote:
>> On 09/07/2010 05:57 PM, Anthony Liguori wrote:
>>>> I agree that streaming should be generic, like block migration. The
>>>> trivial generic implementation is:
>>>>
>>>> void bdrv_stream(BlockDriverState* bs)
>>>> {
>>>> for (sector = 0; sector< bdrv_getlength(bs); sector += n) {
>>>> if (!bdrv_is_allocated(bs, sector,&n)) {
>>>
>>> Three problems here. First problem is that bdrv_is_allocated is
>>> synchronous.
>>
>> Put the whole thing in a thread.
>
> It doesn't fix anything. You don't want stream to serialize all I/O
> operations.
Why would it serialize all I/O operations? It's just like another vcpu
issuing reads.
>
>>> The second problem is that streaming makes the most sense when it's
>>> the smallest useful piece of work whereas bdrv_is_allocated() may
>>> return a very large range.
>>>
>>> You could cap it here but you then need to make sure that cap is at
>>> least cluster_size to avoid a lot of unnecessary I/O.
>>
>> That seems like a nice solution. You probably want a multiple of the
>> cluster size to retain efficiency.
>
> What you basically do is:
>
> stream_step_three():
> complete()
>
> stream_step_two(offset, length):
> bdrv_aio_readv(offset, length, buffer, stream_step_three)
>
> bdrv_aio_stream():
> bdrv_aio_find_free_cluster(stream_step_two)
Isn't there a write() missing somewhere?
>
> And that's exactly what the current code looks like. The only change
> to the patch that this does is make some of qed's internals be block
> layer interfaces.
Why do you need find_free_cluster()? That's a physical offset thing.
Just write to the same logical offset.
IOW:
bdrv_aio_stream():
bdrv_aio_read(offset, stream_2)
stream_2():
if all zeros:
increment offset
if more:
bdrv_aio_stream()
bdrv_aio_write(offset, stream_3)
stream_3():
bdrv_aio_write(offset, stream_4)
stream_4():
increment offset
if more:
bdrv_aio_stream()
Of course, need to serialize wrt guest writes, which adds a bit more
complexity. I'll leave it to you to code the state machine for that.
>
> One of the things Stefan has mentioned is that a lot of the QED code
> could be reused by other formats. All formats implement things like
> CoW on their own today but if you exposed interfaces like
> bdrv_aio_find_free_cluster(), you could actually implement a lot more
> in the generic block layer.
>
> So, I agree with you in principle that this all should be common
> code. I think it's a larger effort though.
Not that large I think; and it will make commit async as a side effect.
>>>
>>> The QED streaming implementation is 140 LOCs too so you quickly end
>>> up adding more code to the block formats to support these new
>>> interfaces than it takes to just implement it in the block format.
>>
>> bdrv_is_allocated() already exists (and is needed for commit), what
>> else is needed? cluster size?
>
> Synchronous implementations are not reusable to implement asynchronous
> anything.
Surely this is easy to fix, at least for qed.
What we need is thread infrastructure that allows us to convert between
the two methods.
> But you need the code to be cluster aware too.
Yes, another variable in BlockDriverState.
>
>>> Third problem is that streaming really requires being able to do
>>> zero write detection in a meaningful way. You don't want to always
>>> do zero write detection so you need another interface to mark a
>>> specific write as a write that should be checked for zeros.
>>
>> You can do that in bdrv_stream(), above, before the actual write, and
>> call bdrv_unmap() if you detect zeros.
>
> My QED branch now does that FWIW. At the moment, it only detects zero
> reads to unallocated clusters and writes a special zero cluster
> marker. However, the detection code is in the generic path so once
> the fsck() logic is working, we can implement a free list in QED.
>
> In QED, the detection code needs to have a lot of knowledge about
> cluster boundaries and the format of the device. In principle, this
> should be common code but it's not for the same reason copy-on-write
> is not common code today.
Parts of it are: commit. Of course, that's horribly synchronous.
--
error compiling committee.c: too many arguments to function
More information about the libvir-list
mailing list