[Linux-cluster] Csnap instantiation and failover using libdlm

Tue Oct 19 05:42:56 UTC 2004

Good morning,

A prototype failover system is now checked in that implements a simple 
instantiation and failover system for csnap servers.  The csnap server 
was modified to connect to the agent and wait for an activation command 
before loading anything from the snapshot store.  All nodes will now be 
running snapshot servers, but only one of them will be active at a 
time.  This is to avoid any filesystem accesses in the server failover 
path, and to keep the number of syscalls used to a minimum so that they 
can all be audited for bounded memory use.

Each node runs a single csnap agent that provides userspace services to 
all snapshot or origin clients of one csnap device.  On receiving a 
connection from a server, the agent tries to make that server the 
master for the cluster, or learn the address of an existing master, 
using the following algorithm:

   Repeat as necessary:
      - Try to grab Protected Write without waiting
      - If we got it, write server address to the lvb, start it, done
      - Otherwise, convert to Concurrent Read without waiting
      - If there's a server address in the lvb, use it, done

The agent will also attempt to do this any time a client requests a 
server connection, which it will do if its original server connection 
breaks.

This algorithm looks simple, but it is racy:

  - Other nodes may read the lvb before the new master writes it,
    getting a stale address, particularly in the thundering rush to
    acquire the lock immediately after a server failure.

  - Other nodes may be using a stale address written by a previous
    master

However, only one server can actually own the lock, and other servers 
will refuse (or discard) connections from clients that have stale 
addresses.  So the race doesn't seem to hurt much, and this algorithm 
will do for the time being.

Eventually this needs to be tightened up.  I suspect that using the dlm 
to distribute addresses is fundamentally the wrong approach and that a 
verifiable algorithm must be based directly on membership events.  The 
dlm should really be doing only the job that it does well: enforcing 
the exclusive lock on the snapshot store.

That said, I have an alternate dlm-based algorithm in mind that uses 
blocking notifications:

   Every node does:
     - Grab exclusive (blocking asts are sent out)
     - if we got it, write the lvb and demote to PW

   Every node that gets a blocking ast does:
     - Demote to null, unblocking the exclusive above
     - Get CR, if the lvb has a server address we're done
     - Otherwise try to grab the exclusive again

This not only closes the race between writing and reading the lvb, it 
sends notifications to all nodes that have stale server addresses (held 
in CR mode).  So it's a little better, but it is still possible to get 
stale addresses if things die at just the wrong time.

Both algorithms are potentially unbounded recursions.  The csnap agents 
have no way of knowing whether somebody out there is just slow, or 
whether somebody is erroneously sitting on the exclusive lock without 
actually instantiating a server.  So after some number of attempts, a 
human operator has to be notified, and the agent will just keep trying.  
I don't like this much at all, and it's one reason why I want to get 
away from lvb games, eventually.  (I have a nagging feeling that 
lvb-based algorithms are only ever thought to be reliable when they get 
so complex that nobody understands them.)

Currently, this is only implemented for gdlm.  Gulm does not have PW or 
CR locks, but equivalent algorithms can be devised using more than one 
lock.  If gulm is supported there will be a separate gulm-csnap-agent 
vs gdlm-csnap-agent, and a plain csnap-agent as well, for running 
snapshots on a single node without any locking libraries installed.

The current prototype only supports IPv4, however IPv6 support only 
requires changes to the user space components.

An agent must be running before a csnap server can be started or a csnap 
device can be created.  Cman must be running before gdlm can be 
started, and ccsd must be running before cman will start.  So a test 
run looks something like this:

   ccsd
   cman_tool join

   csnap-agent @test-control
   csnap-server /dev/test-origin /dev/test-snapstore 8080 @test-control

   echo 0 497976 csnapshot 0 /dev/test-origin /dev/test-snapstore \
   @test-control | /sbin/dmsetup create testdev

For what it's worth, the server and clients can be started in any order.

The three bits of the csnap device are bound together by the 
@test-control named socket, which is fairly nice, it's hard to get this 
wrong.  It's a little annoying that the device names have to be stated 
in two places, "you should never have to tell the computer something it 
already knows".  It's tempting to make them optional on the server 
command line: the server can learn the device names from the device 
mapper target, or they can be given on the command line to run 
stand-alone.  The device mapper device size (497976) in the device 
above is also redundant: the size is also encoded in the snapshot store 
metadata.  It would be better for the device target to be told the size 
once it connects to a server, but that is not the way device mapper 
works at present.

Instantiation and failover now seem to be under control, with the 
caveats above.  The last bit of cluster infrastructure work needed is 
to teach the standby servers how to know when all the snapshot clients 
of a defunct server have either reconnected or left the cluster.  Until 
recently I'd been planning to distribute the connection list to all the 
standby servers, but that is stupid: the local cluster manager already 
knows about the connections and the agent on every node is perfectly 
capable of keeping track of them on behalf of its standby server.

Regards,

Daniel