Help needed with hanging bash script

John Wendel john.wendel at metnet.navy.mil
Mon Jun 25 21:42:44 UTC 2007


Matthew J. Roth wrote:
> Bash gurus,
> 
> I have a bash script that monitors a directory for files.  Whenever it 
> finds files in this directory, it passes them to a support script for 
> processing.  The support script moves the files to another directory 
> prior to processing them, and it is run in the background to prevent 
> blocking the main script.  A simplified version of the main script loop 
> follows:
> 
>  # Execute once every 10 seconds
>  while true;
>  do
>     # Fork a background script to process each file in the spool directory
>     for fname in `ls /spool/dir/*.ext 2> /dev/null`
>     do
>        bname=`basename $fname`
> 
>        bg_script $bname &
>     done
> 
>     sleep 10
>  done
> 
> This is pretty simple and it worked flawlessly for over a year on a dual 
> processor server running Fedora Core 3.  However, after upgrading to an 
> 8 core (2 CPUs x 4 cores) server running Fedora Core 6 the script hangs 
> a few times a week.  This is a bad thing, so I have to keep a close eye 
> on the server until the bug is resolved.
> 
> The process tree of the script when it's hanging follows:
> 
>  [root at server ~]# ps axjf
>   PPID   PID  PGID   SID TTY      TPGID STAT   UID   TIME COMMAND
>      1  3512  3510  2302 ?           -1 S        0   0:59 /bin/bash 
> /usr/local/bin/script
>   3512 21432  3510  2302 ?           -1 R        0  40:50  \_ /bin/bash 
> /usr/local/bin/script
> 
> Note that the parent process (PID 3512) is sleeping and has accumulated 
> relatively little CPU time since boot.  The child process (PID 21432) is 
> running in a hard loop and top shows that it is consuming 100% of one of 
> the cores.  It also never terminates, so it permanently blocks the 
> parent process.  If the child process is killed, the execution of the 
> parent process restarts without any problems.
> 
> The interesting thing is that the script never calls itself.  It only 
> calls the support script as a background job.  I'm not an expert on the 
> inner workings of bash, but I believe that the child process is a 
> temporary artifact of the fork-exec call sequence used to run the 
> commands in the parent.  It seems that a copy of the existing process is 
> created, but it is never overwritten with the child process.
> 
> I researched the logs and I'm fairly confident that the script is 
> hanging at the top of the for loop, presumably after exhausting the list 
> created by the "ls" command.  There is nothing interesting about the 
> "ls" command itself, as there are usually less than 20 files in the 
> directory it's listing.
> 
> I'd appreciate any replies from anyone who has experienced this 
> problem.  I have some ideas for working around it, but I'd like to 
> actually understand its cause and how to properly resolve it so that I 
> don't get stuck on something similar in the future.
> 
> Thank you,
> 
> Matthew Roth
> InterMedia Marketing Solutions
> Software Engineer and Systems Developer
> 


When you see the looping process, run

 > strace -p {looping-pid-number}

this should reveal something interesting. If the problem isn't 
obvious, post some of the output here.

Regards,

John




More information about the fedora-list mailing list