Help needed with hanging bash script

Matthew J. Roth mroth at imminc.com
Mon Jun 25 21:24:31 UTC 2007


Bash gurus,

I have a bash script that monitors a directory for files.  Whenever it 
finds files in this directory, it passes them to a support script for 
processing.  The support script moves the files to another directory 
prior to processing them, and it is run in the background to prevent 
blocking the main script.  A simplified version of the main script loop 
follows:

  # Execute once every 10 seconds
  while true;
  do
     # Fork a background script to process each file in the spool directory
     for fname in `ls /spool/dir/*.ext 2> /dev/null`
     do
        bname=`basename $fname`

        bg_script $bname &
     done

     sleep 10
  done

This is pretty simple and it worked flawlessly for over a year on a dual 
processor server running Fedora Core 3.  However, after upgrading to an 
8 core (2 CPUs x 4 cores) server running Fedora Core 6 the script hangs 
a few times a week.  This is a bad thing, so I have to keep a close eye 
on the server until the bug is resolved.

The process tree of the script when it's hanging follows:

  [root at server ~]# ps axjf
   PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
      1  3512  3510  2302 ?           -1 S        0   0:59 /bin/bash 
/usr/local/bin/script
   3512 21432  3510  2302 ?           -1 R        0  40:50  \_ /bin/bash 
/usr/local/bin/script

Note that the parent process (PID 3512) is sleeping and has accumulated 
relatively little CPU time since boot.  The child process (PID 21432) is 
running in a hard loop and top shows that it is consuming 100% of one of 
the cores.  It also never terminates, so it permanently blocks the 
parent process.  If the child process is killed, the execution of the 
parent process restarts without any problems.

The interesting thing is that the script never calls itself.  It only 
calls the support script as a background job.  I'm not an expert on the 
inner workings of bash, but I believe that the child process is a 
temporary artifact of the fork-exec call sequence used to run the 
commands in the parent.  It seems that a copy of the existing process is 
created, but it is never overwritten with the child process.

I researched the logs and I'm fairly confident that the script is 
hanging at the top of the for loop, presumably after exhausting the list 
created by the "ls" command.  There is nothing interesting about the 
"ls" command itself, as there are usually less than 20 files in the 
directory it's listing.

I'd appreciate any replies from anyone who has experienced this 
problem.  I have some ideas for working around it, but I'd like to 
actually understand its cause and how to properly resolve it so that I 
don't get stuck on something similar in the future.

Thank you,

Matthew Roth
InterMedia Marketing Solutions
Software Engineer and Systems Developer




More information about the fedora-list mailing list