[Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. ...

Tue Jun 20 20:06:02 UTC 2006

On Tue, Jun 20, 2006 at 02:43:50PM -0500, Robert Peterson wrote:
> David Teigland wrote:
> >Might be a good idea, I don't really know.  I'm not even sure we'd need to
> >save much or any additional state that couldn't be pulled from the gfs/dlm
> >instances themselves.  It seems to me the challenge would be writing the
> >daemons so they could put all the pieces and interconnections back
> >together again.
> >
> >If this ends up being a big enough problem to get more attention, I think
> >the first practical improvement we could make is something like
> >blocking/clearing i/o from the residual fs's (like we do in withdraw) and
> >adding the ability to fully purge instances of gfs/dlm from the kernel
> >without rebooting the node.  Then the machines could all start from
> >scratch without rebooting or fencing
> Here's another idea that came to me:
> 
> For critical cluster processes like cman and fenced, maybe we could use
> init's ability to restart processes, i.e. the "respawn" option in
> /etc/inittab.  Maybe we can use "respawn" or something similar to ensure
> that if a critical process like fenced dies, it gets restarted
> automatically and immediately.  Of course, that might cause problems for
> shutdown, etc., and it would probably make it harder to test certain
> things...

Assuming the daemon is managing something, then the failure amounts to a
full node failure and the node needs to be recovered by the other nodes.
Respawning the daemons in that case probably just gets in the way of the
other nodes doing recovery.  Respawning would make sense if the daemons
failed when they weren't managing anything, but that's pretty unlikely.

Dave