[libvirt] using sync_manager with libvirt

Wed Aug 11 19:37:12 UTC 2010

On Wed, Aug 11, 2010 at 05:59:55PM +0100, Daniel P. Berrange wrote:
> On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote:
> > Hi,
> > 
> > We've been working on a program called sync_manager that implements
> > shared-storage-based leases to protect shared resources.  One way we'd like
> > to use it is to protect vm images that reside on shared storage,
> > i.e. preventing two vm's on two hosts from using the same image at once.
> 
> There's two different, but related problems here:
> 
>  - Preventing 2 different VMs using the same disk
>  - Preventing the same VM running on 2 hosts at once
> 
> The first requires that there is a lease per configured disk (since
> a guest can have multiple disks). The latter requires a lease per
> VM and can ignore specifices of what disks are configured.
> 
> IIUC, sync-manager is aiming for the latter.

The present integration effort is aiming for the latter.  sync_manager
itself aims to be agnostic about what it's managing.

> > It's functional, and the next big step is using it through libvirt.
> > 
> > sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a
> > process, renews the lease wile the process runs, and releases the lease
> > when the process exits.  While the process runs, it has exclusive access
> > to whatever resource was named in the lease that was acquired.
> 
> There are complications around migration we need to consider too.
> During migration, you actually need QEMU running on two hosts at
> once. IIRC the idea is that before starting the migration operation,
> we'd have to tell sync-manager to mark the lease as shared with a
> specific host. The destination QEMU would have to startup in shared
> mode, and upgrade this to an exclusive lock when migration completes,
> or quit when migration fails.

sync_manager leases can only be exclusive, so it's a matter of transfering
ownership of the exclusive lock from source host to destination host.  We
have not yet added lease transfer capabilities to sync_manager, but it
might look something like this:

S = source host, sm-S = sync_manager on S, ...
D = destination host, sm-D = sync_manager on D, ...

1. sm-S holds the lease, and is monitoring qemu
2. migration begins from S to D
3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new
   sync_manager option --receive-lease
4. sm-D writes its hostid D to the lease area signaling sm-S that it wants
   to be the lease owner when S is done with it
5. sm-D begins monitoring the lease owner on disk (which is still S)
6. sm-D forks qemu-D
7. sm-S sees that D wants the lease
8. qemu-S exits with success
9. sm-S sees qemu-S exit with success
10. sm-S writes D as the lease owner into the lease area and exits
    (in the non-migration/transfer case, sm-S writes owner=LEASE_FREE)
11. sm-D (still monitoring the lease owner) sees that it has become the
    owner, and begins renewing the lease
12. qemu-D runs fully

I don't know enough (anything) about qemu migration yet to say if those
steps work correctly or safely.  One concern is that qemu-D should not
enter a state where it can write until we are certain that D has been
written as the lease's owner.

> > sync_manager daemon -i <host_id> -n <vm_id> -l <lease> -c <command> <args>

> >   <lease>	defines the shared storage area that sync_manager should
> > 		use for performing the disk-paxos based synchronization.
> > 		It consists of <resource_name>:<path>:<offset>, where
> > 		<resource_name> is likely to be the vm name/uuid (or the
> > 		name of the vm's disk image), and <path>:<offset> is an
> > 		area of shared storage that has been allocated for
> > 		sync_manager to use (a separate area for each resource/vm).
> 
> Can you give some real examples of the lease arg ?  I guess <path> must
> exclude the ':' character, or have some defined escaping scheme.

-l vm0:/dev/vg/lease_area:0

(exclude : from paths)

Manually setting up, intializing and keeping track of lease areas would be
a pain, so we'll definately be looking at adding that to higher level tools.

> The <host_id> is obviously needs to be in /etc/libvirt/sync-manager.conf
> since that's a per-host config. I assume the shared storage area is per
> host too ?
> 
> That leaves just the VM name/uuid as a per-VM config option, and we 
> obviously already have that in XML.  Is there actually any extra
> attribute we need to track per-guest in the XML ? If not this will
> simplify life, because we won't have to track sync-manager specific
> attributes

With the plugin style hooks you describe below, it seems all the
sync_manager config could be kept separate from the libvirt config.

> In terms of integration with libvirt, I think it is desirable that we keep
> libvirt and sync-manager loosely coupled. ie We don't want to hardcode
> libvirt using sync-manager, nor do we want to hardcode sync-manager only
> working with libvirt.
> 
> This says to me that we need to provide a well defined plugin system for
> providing a 'supervisor process' for QEMU guests. Essentially a dlopen()
> module that provides a handful (< 10) callbacks which are triggered in
> appropriate codepaths. At minimum I expect we need
> 
>  - A callback at ARGV building, to let extra sync-manager ARGV to be injected
>  - A callback at VM startup. Not needed for sync-manager, but to allowfor
>    alternate impls that aren't based around supervising.
>  - A callback at VM shutdown. Just to cleanup resources
>  - A callback in the VM destroy method, in case we need todo something
>    different other than just kill($PID) the QEMU $PID. (eg to perhaps
>    tell sync-manager to kill QEMU instead of killing it ourselves)
>  - Several callbacks at various stages of migration to deal with
>    lock downgrade/upgrade

sounds good

> The one further complication is with the security drivers. IIUC, we will
> absolutely not want QEMU to have any access to the shared storage lease
> area. The problem is that if we just inject the wrapper process as is,
> sync-manager will end up running with exact same privileges as QEMU.
> ie same UID:GID, and same selinux context. I'm really not at all sure
> how to deal with this problem, because our core design is that the thing
> we spawn inherits the privileges we setup at fork() time. We don't want
> to delegate the security setup to sync-manager, because it introduces
> a huge variable condition in the security system. We need guarenteed
> consistent security setup for QEMU, regardless of supervisor process
> in use.

It might not be a big problem for qemu to write to its own lease area,
but writing to another's probably would (e.g. at a different offset on the
same lv).  That implies a separate lease lv per qemu; I'll have to find
out how close that gets to lvm scalability limits.

Dave