[Crash-utility] crash sometimes doesn't terminate, loops forever looking for a process that doesn't exist
Dave Anderson
anderson at redhat.com
Mon Nov 7 14:24:53 UTC 2011
----- Original Message -----
> Dear crash-utility,
>
> In our vmcore analysis infrastructure we stumbled on a case where crash
> doesn't terminate. When examining the state of the process with gdb
> it seems to be looping forever over /proc/$pid/stat in an attempt to determine
> the PID of a process that doesn't exist any more.
>
> The backtrace:
> (gdb) bt
> #0 0x00007fb9814fea57 in munmap () from /lib/libc.so.6
> #1 0x00007fb9814a30aa in _IO_setb () from /lib/libc.so.6
> #2 0x00007fb9814a1d18 in _IO_file_close_it () from /lib/libc.so.6
> #3 0x00007fb981495a48 in fclose () from /lib/libc.so.6
> #4 0x00000000004fe75b in output_command_to_pids () at cmdline.c:775
> #5 0x00000000004fed7c in setup_redirect (origin=1) at cmdline.c:519
> #6 0x00000000005012bb in process_command_line () at cmdline.c:149
> #7 0x000000000045f575 in main_loop () at main.c:610
> #8 0x0000000000541af9 in captured_command_loop (data=0x7fb982282000) at ./main.c:226
> #9 0x000000000053fd8b in catch_errors (func=<value optimized out>, func_args=<value optimized out>, errstring=<value optimized out>, mask=<value optimized out>) at exceptions.c:520
> #10 0x00000000005415b6 in captured_main (data=<value optimized out>) at ./main.c:924
> #11 0x000000000053fd8b in catch_errors (func=<value optimized out>,
> func_args=<value optimized out>, errstring=<value optimized out>, mask=<value optimized out>) at exceptions.c:520
> #12 0x0000000000540994 in gdb_main (args=0x1000) at ./main.c:939
> #13 0x00000000005409ce in gdb_main_entry (argc=<value optimized out>, argv=0x1000) at ./main.c:959
> #14 0x000000000046025a in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:525
>
> The problematic code:
> 720 /*
> 721 * Determine the pids of the current popen'd shell and output command.
> 722 * This is all done using /proc; the ps kludge at the bottom of
> this
> 723 * routine is legacy, and should only get executed if /proc doesn't exist.
> 724 */
> 725 static int
> 726 output_command_to_pids(void)
> 727 {
> ...
> 738 int retries;
> 739
> 740 retries = 0;
> 741 pc->pipe_pid = pc->pipe_shell_pid = 0;
> 742 sprintf(lookfor, "(%s)", pc->pipe_command);
> 743 stall(1000);
> 744 retry:
> 745 if (is_directory("/proc") && (dirp = opendir("/proc")))
> {
> 746 for (dp = readdir(dirp); dp && !pc->pipe_pid;
> 747 dp = readdir(dirp)) {
> 748 if (!decimal(dp->d_name, 0))
> 749 continue;
> 750 sprintf(buf1, "/proc/%s/stat",
> dp->d_name);
> 751 if (file_exists(buf1, NULL) &&
> 752 (stp = fopen(buf1, "r"))) {
> 753 if (fgets(buf2, BUFSIZE, stp)) {
> 754 pid = strtok(buf2, " ");
> 755 name = strtok(NULL, "
> ");
> 756 status = strtok(NULL, "
> ");
> 757 p_pid = strtok(NULL, "
> ");
> 758 pgrp = strtok(NULL, "
> ");
> 759 if (STREQ(name, "(sh)")
> &&
> 760 (atoi(p_pid) == getpid()))
> 761
> pc->pipe_shell_pid = atoi(pid);
> 762 if (STREQ(name, lookfor)
> &&
> 763 ((atoi(p_pid) ==
> getpid()) ||
> 764 (atoi(p_pid) ==
> pc->pipe_shell_pid)
> 765 || (atoi(pgrp) ==
> getpid()))) {
> 766 pc->pipe_pid =
> atoi(pid);
> 767 console(
> 768 "FOUND[%d] (%d->%d->%d) %s %s p_pid:
> %s pgrp: %s\n",
> 769 retries,
> getpid(),
> 770
> pc->pipe_shell_pid,
> 771
> pc->pipe_pid,
> 772 name,
> status,
> p_pid, pgrp);
> 773 }
> 774 }
> 775 fclose(stp);
> 776 }
> 777 }
> 778 closedir(dirp);
> 779 }
> 780
> 781 if (!pc->pipe_pid && ((retries++ < 10) || pc->pipe_shell_pid)) {
> 782 stall(1000);
> 783 goto retry;
> 784 }
>
> Looking at how many times it has been looping over /proc:
> (gdb) p retries
> $19 = 138056108
>
> It found the PID of the shell but not of the command:
> (gdb) p pc->pipe_shell_pid
> $20 = 9306
> (gdb) p pc->pipe_pid
> $21 = 0
>
> For completeness the command that was being run was looking like this:
> (gdb) p pc->orig_line
> $26 = "log | grep -A1 'some string' >> /some/file", '\000' [...]
>
> So it seems something like this happened:
> +>popen(grep)
> +--> fork(); execve(sh)
> +---> fork(); execve(grep)
> +----> grep exit()s for some reason
> +>crash(8) finds sh in /proc
> +---> sh exit
> +>crash(8) keeps looking for grep in /proc
>
> I have a second core showing a similar situation if that's of any use but
> now we just work around the problem by wrapping crash(8) within timeout(1).
>
> We could try and fix that function to bail out when the shell exits
> but it really doesn't look like a nice way to do it to me. So I looked
> at the reasons we want the PID of that command and it seems there are
> two:
>
> * determining whether the process is still alive
> This can be done by checking whether the intervening shell is still alive.
> Obtaining only the PID of the shell seems less problematic than trying to
> get the PID of the grandchildren. At worst reimplementing popen()
> to store the PID of sh is not exactly hard.
>
> * terminating the process forcibly (SIGKILL)
> This is done in close_output() which is only called from within restart() when
> its argument is not SIGSEGV, SIGPIPE, SIGINT or 0. I cannot find that function
> being set as a signal handler for anything else or being called with an
> argument different from 0. As far as I can tell this is dead code.
>
> Before I write/test/submit a patch, am I missing something or would it make
> sense to get rid of that pipe_pid?
>
> Thanks,
> Adrien
Hi Adrien,
I would be a little hesitant to get rid of the pc->pipe_pid at this point
in time.
I can't seem to be able to reproduce it, but certainly there should
be an escape valve in output_commands_to_pid() to recognize it and bail
out. But I presume that your piped command sequence actually worked,
and so it would be strange/unnecessary for setup_redirect() to do the
error(FATAL_RESTART, ...) that it currently does when output_commands_to_pid()
returns with a NULL?
Anyway, my point is to try to keep the fix as simple as possible...
Thanks,
Dave
More information about the Crash-utility
mailing list