[Crash-utility] crash endlessly looping on stdout error
Guy Streeter
streeter at redhat.com
Wed Feb 22 17:01:02 UTC 2012
On 02/21/2012 10:44 AM, Dave Anderson wrote:
>
>
> ----- Original Message -----
>> We have a recurring problem in our crash analysis system, where remote users
>> get disconnected and crash starts endlessly looping trying to write to stdout.
>> An strace of a recent instance is looping on:
>>
>> write(1, " JIFFIES\n", 10) = -1 EIO (Input/output error)
>>
>> but that isn't always the output string.
>>
>> this is a problem in out shared environment because the orphaned crash tasks
>> eat up the CPUs, and we don't have the privilege to kill each others tasks.
>>
>> thanks,
>> --Guy
>
> Hmmm, upon initial glance, this seemed to be related to the crash-5.0.2
> fix that you guys reported:
>
> - Fix to prevent a crash session that is run over a network connection
> that is killed/removed from going into 100% cpu-time loop. Without
> the patch, the behavior of the built-in readline() library call in
> gdb-7.0 has changed such that the function returns when the EOF is
> encountered on /dev/tty, and the crash session goes into an endless
> loop; whereas in gdb-6.1, the readline() call never returns because
> the crash session gets killed while running in the library code.
> (anderson at redhat.com)
>
> But if the orphaned task is repetetively writing the same thing, it
> would never get to the next readline() call, where it would kill
> itself. Taking your example, the "JIFFIES" write() is part of a "timer"
> command, but I'm trying to understand how/why the command is not just
> completing a series of (failed) fprintf's, and then falling into
> the next readline() -- where it should kill itself? By any chance
> was the remote caller doing a "repeat" command on the live system,
> or something like that? (sounds doubtful since you'd have to have
> root privileges to do that...)
>
This is not a live system. This is the setup where we analyze vmcores sent in
by our customers.
I don't understand how it happens either, unless for some reason fprintf is
re-trying the failed write().
This is not the only failure scenario. I just saw another one repeating on
this sequence:
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8) = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
Perhaps it isn't a crash program issue at all. Maybe it's at the system
library level.
--Guy
More information about the Crash-utility
mailing list