Help needed with hanging bash script
John Wendel
john.wendel at metnet.navy.mil
Mon Jun 25 21:42:44 UTC 2007
Matthew J. Roth wrote:
> Bash gurus,
>
> I have a bash script that monitors a directory for files. Whenever it
> finds files in this directory, it passes them to a support script for
> processing. The support script moves the files to another directory
> prior to processing them, and it is run in the background to prevent
> blocking the main script. A simplified version of the main script loop
> follows:
>
> # Execute once every 10 seconds
> while true;
> do
> # Fork a background script to process each file in the spool directory
> for fname in `ls /spool/dir/*.ext 2> /dev/null`
> do
> bname=`basename $fname`
>
> bg_script $bname &
> done
>
> sleep 10
> done
>
> This is pretty simple and it worked flawlessly for over a year on a dual
> processor server running Fedora Core 3. However, after upgrading to an
> 8 core (2 CPUs x 4 cores) server running Fedora Core 6 the script hangs
> a few times a week. This is a bad thing, so I have to keep a close eye
> on the server until the bug is resolved.
>
> The process tree of the script when it's hanging follows:
>
> [root at server ~]# ps axjf
> PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
> 1 3512 3510 2302 ? -1 S 0 0:59 /bin/bash
> /usr/local/bin/script
> 3512 21432 3510 2302 ? -1 R 0 40:50 \_ /bin/bash
> /usr/local/bin/script
>
> Note that the parent process (PID 3512) is sleeping and has accumulated
> relatively little CPU time since boot. The child process (PID 21432) is
> running in a hard loop and top shows that it is consuming 100% of one of
> the cores. It also never terminates, so it permanently blocks the
> parent process. If the child process is killed, the execution of the
> parent process restarts without any problems.
>
> The interesting thing is that the script never calls itself. It only
> calls the support script as a background job. I'm not an expert on the
> inner workings of bash, but I believe that the child process is a
> temporary artifact of the fork-exec call sequence used to run the
> commands in the parent. It seems that a copy of the existing process is
> created, but it is never overwritten with the child process.
>
> I researched the logs and I'm fairly confident that the script is
> hanging at the top of the for loop, presumably after exhausting the list
> created by the "ls" command. There is nothing interesting about the
> "ls" command itself, as there are usually less than 20 files in the
> directory it's listing.
>
> I'd appreciate any replies from anyone who has experienced this
> problem. I have some ideas for working around it, but I'd like to
> actually understand its cause and how to properly resolve it so that I
> don't get stuck on something similar in the future.
>
> Thank you,
>
> Matthew Roth
> InterMedia Marketing Solutions
> Software Engineer and Systems Developer
>
When you see the looping process, run
> strace -p {looping-pid-number}
this should reveal something interesting. If the problem isn't
obvious, post some of the output here.
Regards,
John
More information about the fedora-list
mailing list