[Linux-cluster] Csnap instantiation and failover using libdlm
Daniel Phillips
phillips at redhat.com
Tue Oct 19 05:42:56 UTC 2004
Good morning,
A prototype failover system is now checked in that implements a simple
instantiation and failover system for csnap servers. The csnap server
was modified to connect to the agent and wait for an activation command
before loading anything from the snapshot store. All nodes will now be
running snapshot servers, but only one of them will be active at a
time. This is to avoid any filesystem accesses in the server failover
path, and to keep the number of syscalls used to a minimum so that they
can all be audited for bounded memory use.
Each node runs a single csnap agent that provides userspace services to
all snapshot or origin clients of one csnap device. On receiving a
connection from a server, the agent tries to make that server the
master for the cluster, or learn the address of an existing master,
using the following algorithm:
Repeat as necessary:
- Try to grab Protected Write without waiting
- If we got it, write server address to the lvb, start it, done
- Otherwise, convert to Concurrent Read without waiting
- If there's a server address in the lvb, use it, done
The agent will also attempt to do this any time a client requests a
server connection, which it will do if its original server connection
breaks.
This algorithm looks simple, but it is racy:
- Other nodes may read the lvb before the new master writes it,
getting a stale address, particularly in the thundering rush to
acquire the lock immediately after a server failure.
- Other nodes may be using a stale address written by a previous
master
However, only one server can actually own the lock, and other servers
will refuse (or discard) connections from clients that have stale
addresses. So the race doesn't seem to hurt much, and this algorithm
will do for the time being.
Eventually this needs to be tightened up. I suspect that using the dlm
to distribute addresses is fundamentally the wrong approach and that a
verifiable algorithm must be based directly on membership events. The
dlm should really be doing only the job that it does well: enforcing
the exclusive lock on the snapshot store.
That said, I have an alternate dlm-based algorithm in mind that uses
blocking notifications:
Every node does:
- Grab exclusive (blocking asts are sent out)
- if we got it, write the lvb and demote to PW
Every node that gets a blocking ast does:
- Demote to null, unblocking the exclusive above
- Get CR, if the lvb has a server address we're done
- Otherwise try to grab the exclusive again
This not only closes the race between writing and reading the lvb, it
sends notifications to all nodes that have stale server addresses (held
in CR mode). So it's a little better, but it is still possible to get
stale addresses if things die at just the wrong time.
Both algorithms are potentially unbounded recursions. The csnap agents
have no way of knowing whether somebody out there is just slow, or
whether somebody is erroneously sitting on the exclusive lock without
actually instantiating a server. So after some number of attempts, a
human operator has to be notified, and the agent will just keep trying.
I don't like this much at all, and it's one reason why I want to get
away from lvb games, eventually. (I have a nagging feeling that
lvb-based algorithms are only ever thought to be reliable when they get
so complex that nobody understands them.)
Currently, this is only implemented for gdlm. Gulm does not have PW or
CR locks, but equivalent algorithms can be devised using more than one
lock. If gulm is supported there will be a separate gulm-csnap-agent
vs gdlm-csnap-agent, and a plain csnap-agent as well, for running
snapshots on a single node without any locking libraries installed.
The current prototype only supports IPv4, however IPv6 support only
requires changes to the user space components.
An agent must be running before a csnap server can be started or a csnap
device can be created. Cman must be running before gdlm can be
started, and ccsd must be running before cman will start. So a test
run looks something like this:
ccsd
cman_tool join
csnap-agent @test-control
csnap-server /dev/test-origin /dev/test-snapstore 8080 @test-control
echo 0 497976 csnapshot 0 /dev/test-origin /dev/test-snapstore \
@test-control | /sbin/dmsetup create testdev
For what it's worth, the server and clients can be started in any order.
The three bits of the csnap device are bound together by the
@test-control named socket, which is fairly nice, it's hard to get this
wrong. It's a little annoying that the device names have to be stated
in two places, "you should never have to tell the computer something it
already knows". It's tempting to make them optional on the server
command line: the server can learn the device names from the device
mapper target, or they can be given on the command line to run
stand-alone. The device mapper device size (497976) in the device
above is also redundant: the size is also encoded in the snapshot store
metadata. It would be better for the device target to be told the size
once it connects to a server, but that is not the way device mapper
works at present.
Instantiation and failover now seem to be under control, with the
caveats above. The last bit of cluster infrastructure work needed is
to teach the standby servers how to know when all the snapshot clients
of a defunct server have either reconnected or left the cluster. Until
recently I'd been planning to distribute the connection list to all the
standby servers, but that is stupid: the local cluster manager already
knows about the connections and the agent on every node is perfectly
capable of keeping track of them on behalf of its standby server.
Regards,
Daniel
More information about the Linux-cluster
mailing list