[Linux-cluster] Re: Csnap instantiation and failover using libdlm

Fri Oct 22 18:39:46 UTC 2004

On Friday 22 October 2004 13:31, Benjamin Marzinski wrote:
> On Thu, Oct 21, 2004 at 08:27:46PM -0400, Daniel Phillips wrote:
> > On Thursday 21 October 2004 17:56, Benjamin Marzinski wrote:
> > > If the agent dies but the server doesn't, the lock will get revoked.
> > > While this won't interfere with the clients currently connected to
> > > the server, any new client (or client that gets disconnected) will
> > > think that there is no server, and promote it's server to master....
> > > and data corruption will follow.
> > >
> > > As far as I can tell, the way to ensure that this doesn't happen is
> > > to have the server process take out the lock. That way the lock won't
> > > be freed unless the server process dies. Agreed?
> >
> > No, the way to ensure this is to have the server die if its control
> > socket goes away.
>
> O.k. lets say that you come up with a way to have a signal send to the
> server when the agent dies.  (I can't think of any easy way to do this,
> but that would be the surest way of killing the server immediately after
> the agent dies).

Poll event.

> Even in that case, there is still the possibility of 
> corruption.  Say you have nodes A and B, both with csnap_servers. A is
> running the master server. Lets just say, to make this example more
> reasonable, that A and B are accessing the origin and snapstore over
> iSCSI. a request comes into the server on A from node B that causes it to
> write to the snapstore. Due to some network issue, this takes a while. 
> At the same time the agent running on A dies. Somehow, a signal
> automatically gets sent to the server on A.  However, since this server
> is in the D state, the signal does not get delivered yet. This network
> issue also causes node B to loose connection to node A.  It sees that
> there is no server, starts its own and sends the request to its server.
> The servers finally write to the snapstore at the same time, and the disk
> is corrupted.

As far as snapshot store integrity goes, there is no difference between the 
server continuing to coast on a little, and in-flight writes continuing to 
flow through the IO stack.  I stated earlier that the new server would have 
to take special measures if the dead server's node is still in the cluster, 
perhaps you missed it.

Special measures means fencing, hopefully lightweight fencing such as 
banning the old server from the network block device.

> o.k. it's a corner case, but there are probably more likely cases too.
> There is no way that I know of to notify the server in a way that would
> guarantee that there was no corruption.  The risk isn't high, but it is
> there.

It's the same as any other failover.

> > However, you have pointed out why it's bad for the new server to rely
> > only on the lock to decide when its safe to start processing requests,
> > or even to recover the journal: there may still be writes in flight
> > from the old server.  If a server dies but its node is still in the
> > cluster, the new server's agent has to regard that as a valid reason
> > for fencing the node.  This can only be handled properly at the
> > membership level, not at the lock level.
>
> Yes, fencing would fix this, but weren't you pushing for the least
> drastic solution to the problem.

This is the least drastic solution.

Having the server try to hang onto the lock is a non-solution, there could 
still be writes in flight even after the server exits.  Nothing is solved 
by having the server hang on to the lock after it's already marked for 
death.  This slows down the recovery process if anything.

Anyway, if the agent is dead then the new server won't start, every agent 
has to answer

> Since all the server processes only 
> write directly to the disk, if the server process is dead, then no writes
> will reach the disk. That's the reason why you can't kill processes in
> the 'D' state. As long as you have journaling to clean up the unfinished
> transactions it should be perfectly safe to failover once the server is
> dead.
>
> If the agent is dead, but the server isn't, then you have problems.
> Really, since the new server can't start until all the clients break
> their connection with the old server, the only issue you have to worry
> about is that the old server might be stuck waiting for a write to
> complete, as in the example above. Since that wouldn't cause the agent to
> die, fencing because the agent died is almost always unnecessary.
>
> Besides that, you add a bunch of complexity to the agent.  Say that SM
> tells the agent that a node is no longer in the service group, but it
> still is in the cluster. The agent has to decide if the node cleanly left
> the group or not. I suppose that it could check to see if it has any
> clients connected to the server, and if so, it knows that the server has
> not left cleanly. But what if the agent on the new server node doesn't
> have any clients connected to the old server. It doesn't know whether or
> not other nodes do. It would have to communicate with every agent in the
> service group to see if the server should be fenced.
>
> If the server grabbed the lock instead, you would only fail over when the
> server was dead. As I said before, once the server is dead, it is
> completely safe to fail over.

No, there could still be writes in flight.

> > > If that's the case, should the server also be responsible for
> > > contacting the agents in the appropriate service group and getting
> > > the client information?
> >
> > It's not the case, so we don't have to worry about it.
> >
> > The only interesting argument I know of for moving infrastructure
> > details into the server is to get rid of one daemon,
>
> And to eliminate a corner case that causes corruption without having
> agents fencing nodes for usually no good reason. And to keep from adding
> a bunch of addition code, that wouldn't be necessary if the code was
> moved.
>
> > but daemons are
> > cheap, particularly if they sleep nearly all the time like the agent
> > does.  It's better to keep the agent and daemon separate and
> > specialized for the time being.
>
> I don't think we have to pull all the infrastructure into the server. I
> think it seems logical, but I you are against it, I don't really care.
> But I do believe that having the agent grab and hold the lock, instead of
> having the server do it, is a bad idea.