[Linux-cluster] Re: Csnap instantiation and failover using libdlm
Daniel Phillips
phillips at redhat.com
Tue Oct 19 17:48:15 UTC 2004
On Tuesday 19 October 2004 11:20, Benjamin Marzinski wrote:
> Daniel, I think this is a perfectly reasonable method of failing the
> server over (I'd like the clients not to be dependant on a userspace
> process for reconnection, but that's another issue). Only, unless I
> am misunderstanding something, it seems to go directly against one of
> your earlier requirements.
>
> The issue is failure detection. Previously, you indicated that you
> were in favor of failure detection by the client or at the very
> least, some outside agent. As far as I understand the method you are
> implementing, As long as the server doesn't give up the lock, it will
> treated as healthy.
As always, the client detects a broken server connection and asks for a
new connection.
> Is their some method for the lock to be revoked,
Killing the agent that has it should do the job, which would be part of
stomith. There also has to be a way of giving up the lock gracefully
when a node exits the cluster voluntarily. I neglected to mention
"graceful node exit and cleanup" as another bit of infrastructure glue
still needed.
> or some sort of heart-beating, or have you just relaxed that
> requirement.
I speculated that eventually, the kernel client might heartbeat its
connection, I think I used the term "ping". That still seems like a
good idea: then the client can detect connection failure when it
occurs, not just when it attempts to service a request over the
connection. Nothing else changes.
There is also nothing preventing heartbeating at the node level as well.
The csnap bits do not have to participate in it. If the node-level
heartbeat fails the node will be ejected and will receive a membership
event, which the csnap agent will pick up when that part is
implemented.
> O.k. stupid question time: If a userspace process graps this
> exclusive lock, and then dies unexpectedly, does the lock
> automatically get freed?
Yes. Though I haven't looked closely at this, it seems the locks are
cleaned up when the fd that libdlm creates to pass lock completions to
userspace is closed. It's not strictly coupled to process exit, it's
even more sensible.
> If not, who is freeing the lock? I'm
> probably missing something here, but I don't quite understand how
> server failure detection will work.
The lock is really there primarily to enforce exclusive ownership of the
snapshot store device. If the client says the connection is bad, the
agent will believe the client and initiate recovery using the algorithm
above, which is more or less functional but is never going to be
entirely satisfactory until it incorporates membership events
explicitly.
Regards,
Daniel
More information about the Linux-cluster
mailing list