Process Watchdogs

Mon Jan 24 05:43:31 UTC 2005

Short answer: Take a look a Dan Bernstein's daemontools
(http://cr.yp.to/daemontools.html).

Long answer: Here are a few approaches to keeping a process running.

- Have a parent process spawn the watched process.  If the WP dies,
  the parent can know immediately if it's pending on a wait() for
  the WP, or if it's arranged to handle a SIGCHLD caused by the
  WP exit.  This may be the easiest way to do it, since you don't
  have to do anything to the WP itself.  Running the WP out of
  inittab is basically taking this approach.

- Use some kind of pipe with one end open in the WP, the other in
  the monitor.  If the WP exits, the monitor end of the pipe can
  detect this.  Some systems favor this strategy, since you can
  check for lots of WP exits when convenient; e.g., with a select()
  on the pipes of N WPs.  Can't think of a good example off the
  top of my head, though...

- Provide the WP with some kind of status harness, like a
  control/status socket, or some specific response to a UNIX
  signal (e.g., dump current program status with SIGUSR1,
  restart if the status request doesn't get a response in
  reasonable time).  It looks to me as if the monit tools
  provide a framework for this (http://www.tildeslash.com/monit/).

- Periodically poll via UNIX to see if the WP is still alive.
  This can be done by sending signal 0 (i.e., kill(NNNNN, 0)) or
  if you want, through all kinds of other interfaces like
  checking for /proc/NNNNN/status (Linux specific) or grepping
  through ps commands (like pgrep).  However, you should do
  some checking to be sure that it's really the process you
  think it is, not just some random proc that spawned and
  reclaimed pid NNNNN after your WP silently died.

Romain Kang                             Disclaimer: I speak for myself alone,
romain at kzsu.stanford.edu                except when indicated otherwise.