[libvirt] [Qemu-devel] live snapshot wiki updated

Fri Jul 22 05:06:54 UTC 2011

On Thu, Jul 21, 2011 at 8:42 PM, Blue Swirl <blauwirbel at gmail.com> wrote:
> On Thu, Jul 21, 2011 at 6:01 PM, Stefan Hajnoczi <stefanha at gmail.com> wrote:
>> On Thu, Jul 21, 2011 at 3:02 PM, Eric Blake <eblake at redhat.com> wrote:
>>> Thank you for persisting - you've found another hole that needs to be
>>> plugged.  It sounds like you are proposing that after a qemu process dies,
>>> that libvirt re-reads the qcow2 metadata headers, and validates that the
>>> backing file information has not changed in a manner unexpected by libvirt.
>>>  If it has, then the qemu process that just died was compromised to the
>>> point that restarting a new qemu process from the old image is now a
>>> security risk.  So this is _yet another_ security aspect that needs to be
>>> coded into libvirt as part of hardening sVirt.
>>
>> The backing file information changes when image streaming completes.
>>
>> Before: fedora.img <- my_vm.qed
>> After: my_vm.qed (fedora.img is no longer referenced)
>>
>> The image streaming operation copies data out of fedora.img and
>> populates my_vm.qed.  When image streaming completes, the backing file
>> is no longer needed and my_vm.qed is updated to drop the backing file.
>>
>> I think we need to design carefully to prevent QEMU and libvirt making
>> incorrect assumptions about who does what.  I really wish that all
>> this image file business was outside QEMU and libvirt - that we had a
>> separate storage management service which handled the details.  QEMU
>> would only do block device operations (no image format manipulation),
>> and libvirt would only delegate to the storage management service.
>> Today we seem to be sprinkling a little bit of storage management into
>> QEMU and a little bit into libvirt :(.
>>
>> In that spirit it is much nicer to think of storage like a SAN
>> appliance where you have LUNs that you access as block devices.  It
>> also provides an API for snapshotting, cloning LUNs, etc.
>>
>> Let's move to that model instead of worrying about how to spread
>> storage logic across QEMU and libvirt.
>
> Would NBD protocol fit to this purpose, or is it too simple? Then
> libvirt would handle the storage format completely and present an NBD
> interface to QEMU (or give an fd to an external service) and QEMU
> would not care about the storage format in this mode at all.

NBD does not support flush (fdatasync).  Therefore it only supports
the slow cache=writethrough mode in a safe manner.

It would be neat to use virtio-blk as the interface because it can be
passed through to the guest.  The guest talks directly to the storage
management service without going through QEMU.  The trick is to do
something like vhost:
1. An ioeventfd for virtqueue (guest->host) kicks
2. An irqfd for host->guest kicks
3. Shared memory for vring and zero-copy data access

The storage management service provides a UNIX domain socket over
which fds can be passed to set up the vhost-like virtio-blk interface.

Moving the image format code into a separate program makes it possible
to safely write to a backing file while VMs are using it because the
storage service can be host-wide, not per-VM.  For example, streaming
a shared backing file over NFS while running VMs using copy-on-write
images.  If we ever want to do deduplication or other global
operations, then this approach is nice too.

To summarize:
The storage service manages image files including creation, deletion,
snapshotting, and actual I/O.  QEMU uses a vhost-like virtio-blk
interface and can pass it directly into the guest.  libvirt uses the
storage service API without needing to parse image files or keep track
of backing file relationships.

Stefan