[Linux-cluster] Cluster snapshot server failover details

Wed Nov 24 13:51:14 UTC 2004

Hi all,

Ben has been working on the cluster infrastructure interface for cluster 
snapshot server failover for the last little whiles, and in that time, 
the two of us have covered a fair amount of conceptual ground, figuring 
out how to handle this reasonably elegantly and racelessly.  It turned 
out to be a _lot_ harder than I first expected.

It's just not very easy to handle all the races that come up.  However, 
I think we've got something worked out that is fairly solid, and should 
work as a model for server failover in general, not just for cluster 
snapshot.  We're also looking to prove that the cluster infrastructure 
interface we intend to propose to the community in general, does in 
fact provide the right hooks to tackle such failover problems, 
including internode messaging that works reliably, even when the 
membersip of the cluster is changing while the failover protocols are 
in progress.

More than once I've wondered if we need a stronger formalism, such as 
virtual synchrony, to do the job properly, or if such a formalism would 
simplify the implementation.  At this point, I have a tentative "no" 
for both questons.  There appears to be no message ordering going on 
here that we can't take care of accurately with the simple cluster 
messaging facility available.  The winning strategy just seems to be to 
enumerate all the possible events, make sure each is handled, and 
arrange things so that ordering does not matter where it cannot 
conveniently be controlled.

Ben, the algorithm down below is based loosely on your code, but covers 
some details that came up while I was putting pen to paper.  No doubt 
I've made a booboo or two somewhere, please shout.

On to the problem.  We are dealing with failover of a cluster snapshot 
server, where some clients have to upload state (read locks) to the new 
server.  On server failover, we have to enforce the following rule:

1) Before the new server can begin to service requests, all snapshot
   clients that were connected to the previous server must have either
   reconnected (and uploaded their read locks) or left the cluster

and for client connection failover:

2) If a snapshot client disconnects, the server can't throw away its
   locks until it knows the client has really terminated (it might just
   have suffered a temporary disconnection)

Though these rules look simple, they blow up into a fairly complex 
interaction between clients, the server, agents and the cluster 
manager.  Rule (1) is enforced by having the new server's agent carry 
out a roll call of all nodes, to see which of them had snapshot clients 
that may have been connected to the failed server.  A complicating 
factor is, in the middle of carrying out the roll call, we have to 
handle client joins and parts, and node joins and parts.

(What is an agent?  Each node has a csnap agent running on it that takes 
care of connecting any number of device mapper csnap clients on the 
node to a csnap server somewhere on the cluster, controls activating a 
csnap server on the node if need be, and controls server failover)

Most of the complexity that arises lands on the agent rather than the 
server, which is nice because the server is already sufficiently 
complex.  The server's job is just to accept connections that agents 
make to it on behalf of their clients, and to notify its own agent of 
those connections so the agent can complete the roll call.  The 
server's agent will also tell the server about any clients that have 
parted, as opposed to just being temporarily disconnected, in order to 
implement rule (2).

One subtle point: in normal operation (that is, after failover is 
completed) when a client connects to its local agent and requests a 
server, the agent will forward a client join message to the server's 
agent.  The agent responds to the join message, and only then the 
remote agent establishes the client's connection to the server.  It has 
to be done this way because the remote agent can't allow a client to 
connect until it has received confirmation from the server agent that 
the client join message was received.  Otherwise, if the remote agent's 
node parts there will be no way to know the client has parted.

Messages

There are a fair number of messages that fly around, collated here to 
help put the event descriptions in context:

The server agent will send:
  - roll call request to remote agents
  - roll call acknowledgement to remote agent
  - client join acknowledgement to remote agent
  - activation message to server

The server agent will receive:
  - initial node list from cluster manager
  - node part message from cluster manager
  - snapshot client connection message from server
  - roll call response from remote agent
  - client join message from remote agent
  - client part message from remote agent
  - connection from local client
  - disconnection from local client

A remote agent will receive:
  - roll call request from server agent
  - roll call acknowledgement from server agent
  - connection from local client
  - disconnection from local client
  - client join acknowledgement from server agent

A remote agent will send:
  - roll call response to server agent
  - client join message to server agent
  - client part message to server agent

A server will receive:
  - client connection
  - activation message from server agent
  - client part from server agent

A server will send:
  - client connection message to server agent

Note: the server does not tell the agent about disconnections; instead, 
the remote agent tells the server agent about client part

Note: each agent keeps a list of local client connections.  It only 
actually has to forward clients that were connected to the failed 
server, but it's less bookkeeping and does no harm just to forward all.

Agent Data

- list of local snapshot client connections

- roll call list of nodes
   - used only for server failover

- client list of node:client pairs
   - used for server failover
   - used for snapshot client permanent disconnection
   - each client has state "failover" or "normal" 

- a failover count
   - how many snapshot client connections still expected

Event handling

Server Agent

Acquires the server lock (failover):
  - get the current list of nodes from the cluster manager
     - Receive it any time before activating the new server
     - Some recently joined nodes may never have been connected to
       the old server and it doesn't matter
     - Don't add nodes that join after this point to the roll call list,
       they can't possibly have been connected to the old server
  - set failover count to zero
  - send a roll call request to the agent on every node in the list

Receives a roll call response:
  - remove the responding node from the roll call list
     - if the node wasn't on the roll call list, what?
  - each response has a list of snapshot clients, add each to
    the client list
     - set state to failover
     - increase the failover count
  - send roll call response acknowledgement to remote agent
  - If roll call list empty and failover count zero, activate server

Receives a node part event from cluster manager:
  - remove the node from the roll call list, if present
  - remove each matching client from the client list
     - if client state was normal send client part to server
     - if client state was failover, decrease the failover count
  - If roll call list empty and failover count zero, activate server

Receives a snapshot client connection message from server:
  - if the client wasn't on the client list, what?
  - if the client state was failover
     - decrease the failover count
     - set client state to normal
  - If roll call list empty and failover count zero, activate server

Receives a client join from a remote agent:
  - add the client to the client list in state normal
  - send client join acknowledgement to remote agent

Receives a client part from a remote agent:
  - if client wasn't on the client list, what?
  - remove the client from the client list
     - if client state was normal send client part to server
     - if client state was failover, decrease the failover count
  - If roll call list empty and failover count zero, activate server

Receives a snapshot client connection from device:
  - Send client join message to self

Receives a snapshot client disconnection from device:
  - Send client part message to self

Remote Agent

Receives a snapshot client connection from device:
  - If roll call response has already been sent, send client join
    message to server agent
  - otherwise just add client to local snapshot client list

Receives a snapshot client disconnection from device:
  - If roll call response has already been sent, send client part
    message to server agent
  - otherwise just remove client from local snapshot client list

Receives a roll call request:
  - Send entire list of local snapshot clients to remote agent

Receives acknowledgement of roll call response from server agent:
  - Connect entire list of local snapshot clients to server

Receives client join acknowledgement from server agent:
  - Connect client to server

Server

Receives a client connection
  - Send client connection message to server agent

Easy, huh?

Daniel