[Linux-cluster] [RFC] Generic Kernel API

Patrick Caulfield pcaulfie at redhat.com
Mon Sep 20 12:25:51 UTC 2004

At the cluster summit most people seedm to agree that we needed a generic,
pluggable kernel API for cluster functions. Well, I've finally got round to
doing something.

The attached spec allows for plug-in cluster modules with the possibility of
a node being a member of multiple clusters if the cluster managers allow it.
I've seperated out the functions of "cluster manager" so they can be provided by
different components if necessary.

Two things that are not complete (or even started) in here are a communications
API and a locking API. 

For the first, I'd like to leave that to those more qualified than me to do and
for the second I'd like to (less modestly) propose our existing DLM API with the
argument that it is a full-featured API that others can implement parts of if

Comments please.

-------------- next part --------------

The kernel holds a list of named cluster management modules which register themselves at
insmod time. Each of these may provide one or more groups of services: "comms", "membership" and "quorum".

In theory a node may be a member of many clusters, though some cluster managers may prevent this.

The kernel APIs presented here are meant to be simple enough to be tidy, but featureful enough
to implement SAF on top in userspace. I don't think it is appropriate to implement the full
SAF specification in kernel space.

Membership ops

struct membership_node_address
	int32_t mna_len;
	char    mna_address[MAX_ADDR_LEN];

struct membership_node
	int32_t				mn_nodeid;
	struct membership_node_address  mn_address
	char				mn_name[MAX_NAME_LEN];
	uint32_t			mn_member;
	struct timeval			mn_boottime;

struct membership_notify_info
	void *		mni_context;
	uint32_t	mni_viewnumber;
	uint32_t	mni_numitems;
	uint32_t	mni_nummembers;
	char *		mni_buffer;

struct membership_ops
	int (start_notify) (void *cmprivate,
			    void *context, uint32_t flags, membership_callback_routine *callback, char *buffer, int max_items);
#define	MEMBERSHIP_FLAGS_NOTIFY_CHANGES  1 /* Notify of membership changes */
#define	MEMBERSHIP_FLAGS_NOTIFY_NODES    2 /* Send me a full node list now */

	int (notify_stop)  (void *cmprivate);
	int (get_name)     (void *cmprivate, char *name, int maxlen);
	int (get_node)     (void *cmprivate, int32_t nodeid, struct membership_node *node);
#define MEMBERSHIP_NODE_THISNODE        -1 /* Get info about local node */


/* This is what is called by membership services as a callback */
typedef int (membership_callback_routine) (void *context, uint32_t reason);

I've made node IDs a signed int32, this allows for a negative pseudo ID for "this node". 
cman uses 0 for "this node" but other membership APIs may allow a real node to have an ID of zero.
SAF uses a "this node" pseudo ID.

Quorum ops

/* These might be a bit too specific... */

struct quorum_info
	uint32_t qi_total_votes;
	uint32_t qi_expected_votes;
	uint32_t qi_quorum;

struct quorum_ops

	int (get_quorate) (void *cmprivate);
	int (get_votes)   (void *cmprivate, int32_t nodeid);
	int (get_info)    (void *cmprivate, struct quorum_info *info);

Bottom interface. 

/* When a CM module is loaded it calls cm_register()
 * which adds its proto_name/ops pair to a global list. */

int cm_register(struct cm_ops *proto);
void cm_unregister(struct cm_ops *proto);

/* A CM sets up one of these structs with the functions it can provide and
 * registers it, along with its name (type) using cm_register() */

struct cm_ops {
        char co_proto_name[256];

        /* These are required */

        int (*co_attach) (struct cm_info *info);
        int (*co_detach) (void *cmprivate);

        /* These are optional, a CM may provide some or all */

        struct cm_comm_ops   *co_cops;
        struct cm_member_ops *co_mops;
        struct cm_quorum_ops *co_qops;


I've omitted the comms interface because I'm not really sure how featured this
really out to be.

We may want to add a locking interface in here too?

Top interface  

/* When cm_attach() is called, the "harness" searches the
 * global list of registered CM's, looking for one with the given
 * proto_name.  If one is found, its co_attach() function is called, being
 * passed the cm_attach() parameters. */

int cm_attach(char *proto_name, char *cluster_name, struct cm_info *info);
void cm_detach(void *cmprivate);

/* When a CM's attach function is called, it fills in the cm_info struct
 * provided by the caller with its own ops functions and values.  This
 * includes its private data pointer to be used with its ops functions. */

struct cm_info {
        struct cm_ops *ops;
        void *cmprivate;


Say "foo" is a "low level" system and provides select comms and member

1. it sets foo_ops
        cm_proto_name = "foo";
        cm_attach = foo_attach;
        cm_detach = foo_detach;
        cm_cops = foo_cops;
        cm_mops = foo_mops;
        cm_qops = NULL;
2. and calls cm_register(&foo_ops);

Say "bar" is a higher level system and provides select member and quorum

1. it sets bar_ops
        cm_proto_name = "bar";
        cm_attach = bar_attach;
        cm_detach = bar_detach;
        cm_cops = NULL;
        cm_mops = bar_mops;
        cm_qops = bar_qops;

2. and calls cm_register(&bar_ops);

Internally, bar could attach to foo and use the functions foo provides.
Bar may provide some member_ops functions that foo doesn't, in addition to
some quorum services, none of which foo provides.  Applications may attach
to just bar, just foo, or in some cases both foo and bar.

bar could be programmed to use foo statically (like lock_dlm is
programmed to use dlm and cman, but gfs can use either lock_dlm or
lock_gulm).  bar could also take the lower level type (foo) as an input
parameter in some way, making it dynamic.

More information about the Linux-cluster mailing list