[libvirt] [PATCH 2/8] Basic framework for lock manager plugins

Tue Nov 30 10:20:38 UTC 2010

On Wed, Nov 24, 2010 at 03:08:41PM -0500, David Teigland wrote:
> On Mon, Nov 22, 2010 at 06:09:21PM +0000, Daniel P. Berrange wrote:
> > +/*
> > + * Flags to pass to 'load_drv' and also 'new_drv' method
> > + * Plugins must support at least one of the modes. If a
> > + * mode is unsupported, it must return an error
> > + */
> > +enum {
> > +    VIR_LOCK_MANAGER_MODE_CONTENT    = (1 << 0),
> > +    VIR_LOCK_MANAGER_MODE_METADATA  = (1 << 1),
> > +} virLockManagerFlags;
> 
> These terms indicate what libvirt is using the locks for, but from a lock
> manager perspective, they don't have any meaning.  It's not clear what the
> lock manager should do differently for one type vs the other.
> 
> Is it the scope of the locks that differs?  i.e. locking between local
> processes on a single host, vs distributed locking between hosts?  If
> that's the case, then maybe name them LOCAL and DISTRIBUTED?

The 'CONTENT' lock is protecting the content of the disk, so the
scope must apply to any host able to access the disk.

The 'METADATA' lock is protecting the file attributes of the disk,
so the scope must apply to an host able to access the same inode.
Thus for block devices, or local filesystems (ext3, xfs, etc) the
scope is only 1 host. For network or cluster filesystems (nfs, gfs
etc) the scope would obviously be multi-host. An 'fcntl' based
lock gives a scope that matches this.

The more important difference between the CONTENT & METADATA locks
though, is the effect of the disk sharing modes. The CONTENT locks
will always take into account the sharing mode, while a METADATA
lock will always be exclusive.

> > +enum {
> > +    /* The resource to be locked is a virtual disk */
> > +    VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK = 0,
> > +} virLockManagerResourceType;
> 
> Generic is good, and there are parts where it could still be more like a
> generic, traditional lock manager API...
> 
> > +typedef enum {
> > +    /* The resource is assigned in readonly mode */
> > +    VIR_LOCK_MANAGER_RESOURCE_READONLY = (1 << 0),
> > +    /* The resource is assigned in shared, writable mode */
> > +    VIR_LOCK_MANAGER_RESOURCE_SHARED   = (1 << 1),
> > +} virLockManagerResourceFlags;
> 
> I really think you should adopt traditional lock modes instead:
>
> 
> LOCK_MODE_EXCLUSIVE (aka WRITE) - incompatible with other SH and EX locks.
> 
> LOCK_MODE_SHARED (aka READ) - compatible with SH locks, not with EX locks.
> 
> LOCK_MODE_NONE - the lock manager is unused, and it's up to the
> application to do its own locking or coordination when accessing the
> resource, e.g. if there's a clustered db or fs on the device that does its
> own coordination.  You seem to be calling this SHARED, but there's
> actually no lock or exclusion that a lock manager would implement for this
> kind of usage.
> 
> Each resource should have its own lock mode, and a mode should always be
> specified (rather than defaulting to EX).  libvirt would use the lock mode
> based on how the resource will be used, it would take a SH lock on a
> readonly resource.

NB the libvirt flags above are *not* expressing lock modes, rather they
are expressing the VM's disk access mode. I wanted to leave it upto the
plugin to decide how to translate this into a lock mode. The rules for
VM disk access mode can be more granular that what a EXCLUSIVE/SHARED
lock mode would allow

 - No flags  - exclusive read/write for the VM
 - SHARED    - shared read/write across any VMs
 - READONLY  - shared read across any VMs

In the future, it is likely that we'll allow grouping of VMs, so the
SHARED disk access mode would not neccessarily equate to a simple
SHARED lock mode. eg, one group of VMs could share disk, and this
disk would not be accessible to any other group of VMs. In other
words, instead of applying a SHARED lock directly on the resource,
there would be scope for a EXCLUSIVE lock on a (resource, vm group)
pair.

Now, whether we'll ever implement this in any lock managers is open
to debate, but I want to at least leave open the possibility in
the lock manager API. Thus disk access modes are preferrable to
directly passing lock modes, so that the lock manager plugin has
maximum flexibility in how it applies the locking.

> non-blocking and force options would probably be quite useful to have
> right from the beginning for testing.

NB all locks are non-blocking. If the lock for a resource is already held,
the plugin would return an error to the caller, and libvirt would abort
startup of the guest. We don't want to block startup indefinitely waiting
for a lock.

I've thought about whether we might want a way to break a lock, or
launch even if locks can't be obtained, but figure this can be added
later without any design impact.

> > +/**
> > + * virLockManagerNew:
> > + * @man: the lock manager context
> > + * @type: the type of process to be supervised
> > + * @flags: optional flags, currently unused
> > + *
> > + * Initialize a new context to supervise a process, usually
> > + * a virtual machine. If the lock driver requested a
> > + * private data, the <code>privateData</code> field of
> > + * <code>man</code> will be initialized with a suitable
> > + * pointer.
> > + *
> > + * Returns 0 if successful initialized a new context, -1 on error
> > + */
> > +typedef int (*virLockDriverNew)(virLockManagerPtr man,
> > +                                unsigned int type,
> > +                                unsigned int flags);
> 
> This is fundamentally about creating/defining a new process or new lock
> owner.  A name like NewOwner or NewPid or NewVM would make that more
> clear.  You refer to it sometimes as a "managed object".

Yep, NewObject would be suitable.

> 
> > +/**
> > + * virLockDriverAddResource:
> > + * @manager: the lock manager context
> > + * @type: the resource type virLockManagerResourceType
> > + * @name: the resource name
> > + * @flags: the resource access flags
> > + *
> > + * Assign a resource to a managed object. This will
> > + * only be called prior to the object is being locked
> > + * when it is inactive. eg, to set the initial  boot
> > + * time disk assignments on a VM
> > + * The format of @name varies according to
> > + * the resource @type. A VIR_LOCK_MANAGER_RESOURCE_TYPE_DISK
> > + * will have a fully qualified file path.
> > + *
> > + * If no flags are given, the resource is assumed to be
> > + * used in exclusive, read-write mode. Access can be
> > + * relaxed to readonly, or shared read-write.
> > + *
> > + * Returns 0 on success, or -1 on failure
> > + */
> > +typedef int (*virLockDriverAddResource)(virLockManagerPtr man,
> > +                                        unsigned int type,
> > +                                        const char *name,
> > +                                        unsigned int flags);
> 
> As mentioned above, I think we want a lock mode arg.
> 
> > +/**
> > + * virLockDriverAcquireObject:
> > + * @manager: the lock manager context
> > + * @state: the current lock state
> > + * @flags: optional flags, currently unused
> > + *
> > + * Start managing resources for the object. If the
> > + * object is being transferred from another location
> > + * the current lock state may be passed in. This
> > + * must be called from the PID that represents the
> > + * object to be managed. If the lock is lost at any
> > + * time, the PID will be killed off by the lock manager.
> > + *
> > + * Returns 0 on success, or -1 on failure
> > + */
> > +typedef int (*virLockDriverAcquireObject)(virLockManagerPtr man,
> > +                                          const char *state,
> > +                                          unsigned int flags);
> 
> AcquireResources?

It depends which way you're looking at it. I tend to think
'Object' implies all associated resources.

> 
> > +/**
> > + * virLockDriverAttachObject:
> > + * @manager: the lock manager context
> > + * @pid: the managed object PID
> > + * @flags: optional flags, currently unused
> > + *
> > + * Re-attach to an existing lock manager instance managing
> > + * PID @pid.
> > + *
> > + * Returns 0 on success, or -1 on failure
> > + */
> > +typedef int (*virLockDriverAttachObject)(virLockManagerPtr man,
> > +                                         pid_t pid,
> > +                                         unsigned int flags);
> > +
> > +/**
> > + * virLockDriverDetachObject:
> > + * @manager: the lock manager context
> > + * @pid: the managed object PID
> > + * @flags: optional flags, currently unused
> > + *
> > + * Deattach from an existing lock manager instance managing
> > + * PID @pid.
> > + *
> > + * Returns 0 on success, or -1 on failure
> > + */
> > +typedef int (*virLockDriverDetachObject)(virLockManagerPtr man,
> > +                                          pid_t pid,
> > +                                          unsigned int flags);
> 
> I guess these two are to allow libvirtd to be restarted while the lock
> manager remains running and holding locks?  Again, Pid/Owner/VM would be
> less obtuse than Object.

The primary AcquireObject call is done in the context of the QEMU
child process, so libvirtd never directly owns the lock.

The AttachObject/DetachObject calls are made by libvirtd, whenever
it is about todo something on behalf of the managed object holding
the lock. eg when libvirtd does disk hotplug it will do

    $man = NewObject()
    AttachObject($man, $QEMUPID);
    AcquireResource($man, $newdiskpath);
    DetachObject($man);
    FreeObject($man)

So basically the AttachObject call is telling the lock manager
plugin what $PID the following AcquireResource operation needs
to be applied to. Chances are your DetachObject call will only
need to be a no-op but I put the hook in just in case there's
a need for it

Regards,
Daniel