[libvirt] using sync_manager with libvirt
David Teigland
teigland at redhat.com
Wed Aug 11 19:37:12 UTC 2010
On Wed, Aug 11, 2010 at 05:59:55PM +0100, Daniel P. Berrange wrote:
> On Tue, Aug 10, 2010 at 12:44:06PM -0400, David Teigland wrote:
> > Hi,
> >
> > We've been working on a program called sync_manager that implements
> > shared-storage-based leases to protect shared resources. One way we'd like
> > to use it is to protect vm images that reside on shared storage,
> > i.e. preventing two vm's on two hosts from using the same image at once.
>
> There's two different, but related problems here:
>
> - Preventing 2 different VMs using the same disk
> - Preventing the same VM running on 2 hosts at once
>
> The first requires that there is a lease per configured disk (since
> a guest can have multiple disks). The latter requires a lease per
> VM and can ignore specifices of what disks are configured.
>
> IIUC, sync-manager is aiming for the latter.
The present integration effort is aiming for the latter. sync_manager
itself aims to be agnostic about what it's managing.
> > It's functional, and the next big step is using it through libvirt.
> >
> > sync_manager "wraps" a process, i.e. acquires a lease, forks&execs a
> > process, renews the lease wile the process runs, and releases the lease
> > when the process exits. While the process runs, it has exclusive access
> > to whatever resource was named in the lease that was acquired.
>
> There are complications around migration we need to consider too.
> During migration, you actually need QEMU running on two hosts at
> once. IIRC the idea is that before starting the migration operation,
> we'd have to tell sync-manager to mark the lease as shared with a
> specific host. The destination QEMU would have to startup in shared
> mode, and upgrade this to an exclusive lock when migration completes,
> or quit when migration fails.
sync_manager leases can only be exclusive, so it's a matter of transfering
ownership of the exclusive lock from source host to destination host. We
have not yet added lease transfer capabilities to sync_manager, but it
might look something like this:
S = source host, sm-S = sync_manager on S, ...
D = destination host, sm-D = sync_manager on D, ...
1. sm-S holds the lease, and is monitoring qemu
2. migration begins from S to D
3. libvirt-D runs sm-D: sync_manager -c qemu with the addition of a new
sync_manager option --receive-lease
4. sm-D writes its hostid D to the lease area signaling sm-S that it wants
to be the lease owner when S is done with it
5. sm-D begins monitoring the lease owner on disk (which is still S)
6. sm-D forks qemu-D
7. sm-S sees that D wants the lease
8. qemu-S exits with success
9. sm-S sees qemu-S exit with success
10. sm-S writes D as the lease owner into the lease area and exits
(in the non-migration/transfer case, sm-S writes owner=LEASE_FREE)
11. sm-D (still monitoring the lease owner) sees that it has become the
owner, and begins renewing the lease
12. qemu-D runs fully
I don't know enough (anything) about qemu migration yet to say if those
steps work correctly or safely. One concern is that qemu-D should not
enter a state where it can write until we are certain that D has been
written as the lease's owner.
> > sync_manager daemon -i <host_id> -n <vm_id> -l <lease> -c <command> <args>
> > <lease> defines the shared storage area that sync_manager should
> > use for performing the disk-paxos based synchronization.
> > It consists of <resource_name>:<path>:<offset>, where
> > <resource_name> is likely to be the vm name/uuid (or the
> > name of the vm's disk image), and <path>:<offset> is an
> > area of shared storage that has been allocated for
> > sync_manager to use (a separate area for each resource/vm).
>
> Can you give some real examples of the lease arg ? I guess <path> must
> exclude the ':' character, or have some defined escaping scheme.
-l vm0:/dev/vg/lease_area:0
(exclude : from paths)
Manually setting up, intializing and keeping track of lease areas would be
a pain, so we'll definately be looking at adding that to higher level tools.
> The <host_id> is obviously needs to be in /etc/libvirt/sync-manager.conf
> since that's a per-host config. I assume the shared storage area is per
> host too ?
>
> That leaves just the VM name/uuid as a per-VM config option, and we
> obviously already have that in XML. Is there actually any extra
> attribute we need to track per-guest in the XML ? If not this will
> simplify life, because we won't have to track sync-manager specific
> attributes
With the plugin style hooks you describe below, it seems all the
sync_manager config could be kept separate from the libvirt config.
> In terms of integration with libvirt, I think it is desirable that we keep
> libvirt and sync-manager loosely coupled. ie We don't want to hardcode
> libvirt using sync-manager, nor do we want to hardcode sync-manager only
> working with libvirt.
>
> This says to me that we need to provide a well defined plugin system for
> providing a 'supervisor process' for QEMU guests. Essentially a dlopen()
> module that provides a handful (< 10) callbacks which are triggered in
> appropriate codepaths. At minimum I expect we need
>
> - A callback at ARGV building, to let extra sync-manager ARGV to be injected
> - A callback at VM startup. Not needed for sync-manager, but to allowfor
> alternate impls that aren't based around supervising.
> - A callback at VM shutdown. Just to cleanup resources
> - A callback in the VM destroy method, in case we need todo something
> different other than just kill($PID) the QEMU $PID. (eg to perhaps
> tell sync-manager to kill QEMU instead of killing it ourselves)
> - Several callbacks at various stages of migration to deal with
> lock downgrade/upgrade
sounds good
> The one further complication is with the security drivers. IIUC, we will
> absolutely not want QEMU to have any access to the shared storage lease
> area. The problem is that if we just inject the wrapper process as is,
> sync-manager will end up running with exact same privileges as QEMU.
> ie same UID:GID, and same selinux context. I'm really not at all sure
> how to deal with this problem, because our core design is that the thing
> we spawn inherits the privileges we setup at fork() time. We don't want
> to delegate the security setup to sync-manager, because it introduces
> a huge variable condition in the security system. We need guarenteed
> consistent security setup for QEMU, regardless of supervisor process
> in use.
It might not be a big problem for qemu to write to its own lease area,
but writing to another's probably would (e.g. at a different offset on the
same lv). That implies a separate lease lv per qemu; I'll have to find
out how close that gets to lvm scalability limits.
Dave
More information about the libvir-list
mailing list