[libvirt] Block replication driver

Wed Mar 23 20:36:10 UTC 2016

On 03/15, Peter Krempa wrote:
> Sorry for my late response ...
> 
> On Mon, Feb 22, 2016 at 22:51:55 +0100, Simon wrote:
> > On 02/16, Peter Krempa wrote:
> > > On Fri, Feb 12, 2016 at 13:22:01 -0700, Eric Blake wrote:
> > > > On 12/02/2015 02:08 AM, Simon Kollberg wrote:

[...]

I think most of what has been discussed so far is related to the
solution further down in one way or another. So I cut down the thread to
make it a bit lighter.

> > 
> > I get why using <disk> for something that is not going to be visible to
> > the guest is a bad idea. On the other hand, in this case the disk that
> > is not going to be attached to the guest is actually the normal disk
> > (labeled "secondary disk" on the block replication wiki). I don't think
> > it makes sense to make that into something that is not <disk>.
> 
> It does. The <devices> portion contains stuff that is guest visible. The
> fact that qemu calls it a disk is a qemu implementation detail. For our
> use it's not guest visible and we are not copying the design of qemu in
> cases where it does not make sense.
> 

Just to make sure that we're on the same page. In the secondary VM, when
block replication is enabled, the disk that is actually written to by
the guest is the block replication 'active disk' and not the main disk.
That is why I'm hesitent to completely cut the block replication disks
out from the disk XML and make it into a entierly different entity.

I do see your point though, since the block replication disks are
supposed to be transparent from the guest's point of view.

> > I think many issues that has been brought up would be solved by a new
> > disk definition where the replication and the normal disk is combined.
> > But again, I don't think that the quorum and the block replication
> > should be defined the same way since block replication acts more as a
> > chain while the quorum as an array.
> > 
> > What do you guys think about this?
> > 
> > <disk type='file' device='disk'>
> >     <driver name='qemu' type='raw'/>
> >     <source file='/opt/libvirt/1.raw'/>
> >     <target dev='vda' bus='virtio'/>
> > 
> >     <replication type='file' device='disk' mode='secondary'>
> >         <driver name='qemu' type='qcow2'/>
> >         <source file='/opt/libvirt/active_disk.qcow2'/>
> >         <backingStore type='file'>
> >             <format type='qcow2'/>
> >             <source file='/opt/libvirt/hidden_disk.qcow2'/>
> >         </backingStore>
> >     </replication>
> 
> This actually looks more reasonable although I have some comments:
> 
> - device='disk' doesn't make any sense here, this is not guest visible

Same response as above. I could agree that it might be redundant to keep
device='disk' in both the <disk> and the <replication> element if the
<replication> element is nested under the <disk>. But the fact is, it's
the replication disk that is being attached as a device, not the normal
disk (this only get writes from the primary through nbd).
See: http://wiki.qemu.org/Features/BlockReplication

> - the <driver> element corresponds to the format of the image. I think
>   that we should actually call it <format> to avoid confusion.

Yea this is a bit confusing, same goes for normal disks. At some places
it's called format and in other places it's called driver. Not sure what
the reason is behind this. In qemu they started to replace format with
driver:
http://git.qemu.org/?p=qemu.git;a=commit;h=74fe54f2a1b5c4f4498a8fe521e1dfc936656cd4

Keeping it similar to the disk schema would also let us re-use some XML
definitions that already exists.

> 
> > 
> > </disk>
> > 
> > This would solve all three issues: referencing, replication driver +
> > file driver in the same definition and a single target. One downside is
> > that the <replication> element is going to be more or less a <disk>
> > clone.
> > 
> > Another option would be to allow for nested <disk> elements and add a
> > bogus replication target.
> > 
> > <disk type='file' device='disk'>
> >     <driver name='qemu' type='raw'/>
> >     <source file='/opt/libvirt/1.raw'/>
> > 
> >     <target dev='replication' bus='virtio'/>
> > 
> >     <disk type='file' device='disk' mode='secondary'>
> >         <driver name='qemu' type='qcow2'/>
> >         <source file='/opt/libvirt/active_disk.qcow2'/>
> >         <target dev='vda' bus='virtio'/>
> >         <backingStore type='file'>
> >             <format type='qcow2'/>
> >             <source file='/opt/libvirt/hidden_disk.qcow2'/>
> >         </backingStore>
> >     </disk>
> > </disk>
> 
> No, this doesn't make any sense and looks hackish.
> 

Alright, let's discard this one. I prefere option 1 as well. Just
unfortunate that two so similar things need two different XML
definitions. In one way it's good to differentiate between the two, but
at the same time it's nice to not having to repeat a whole lot.

> > 
> > > Additionally, looking through the docs for block replication posted on
> > > qemu devel require to use unstable commands such as "x-blockdev-change"
> > > which, until it's stable is not an option in libvirt.
> > > 
> > 
> > I agree that it is still way to early to push this stuff upstream. But I
> > think it's a good thing to start the discussion and experiment with new
> > features before they are marked as stable and the x-tag is removed. That
> > way libvirt is keeping up to date and can also impact the current work
> > done in qemu by doing early testing.
> 
> In libvirt we don't have any prior approach to have experimental
> features, thus we can't really use anything experimental. By using it we
> commited to support it. If you want to add anything experimental we will
> need to formulate a process to add a feature that may vanish later on.
> 

Yea, I'm not proposing that anyone should start pushing experimental
stuff upstream. What I'm saying is that it could be benifitial to start
testing out the experimental features before they are added as stable,
for reasons stated earler.

> > 
> > > Also for all this to be useful, the rest of the COLO stuff needs to be
> > > merged too, otherwise the block replication code won't be used at all.
> > > 
> > 
> > From what I've gathered, block replication is supposed to be an
> > individual feature, separate from COLO. Even so, I see your point about
> > not including something that is not being used at the moment.
> 
> Okay. Fair enough. In that case though, how does that differ from a
> blockdev-mirror job?

Some of the qemu devs could probably give you a better answer to this.
But I assume the two serve different purposes. I don't think they would
spend time creating something that already existed.

>From what I can tell, mirroring/replicating writes from the primary to
the secondary guest is only one part of the block replication
functionality. It also takes care of buffering writes done by the
secondary until they synchronize etc.