[Crash-utility] crash endlessly looping on stdout error

Wed Feb 22 19:39:50 UTC 2012

----- Original Message -----
> On 02/21/2012 10:44 AM, Dave Anderson wrote:
> > 
> > 
> > ----- Original Message -----
> >> We have a recurring problem in our crash analysis system, where remote users
> >> get disconnected and crash starts endlessly looping trying to write to stdout.
> >> An strace of a recent instance is looping on:
> >>
> >> write(1, "  JIFFIES\n", 10)             = -1 EIO (Input/output error)
> >>
> >> but that isn't always the output string.
> >>
> >> this is a problem in out shared environment because the orphaned crash tasks
> >> eat up the CPUs, and we don't have the privilege to kill each others tasks.
> >>
> >> thanks,
> >> --Guy
> > 
> > Hmmm, upon initial glance, this seemed to be related to the crash-5.0.2
> > fix that you guys reported:
> > 
> >     - Fix to prevent a crash session that is run over a network connection
> >       that is killed/removed from going into 100% cpu-time loop.  Without
> >       the patch, the behavior of the built-in readline() library call in
> >       gdb-7.0 has changed such that the function returns when the EOF is
> >       encountered on /dev/tty, and the crash session goes into an endless
> >       loop; whereas in gdb-6.1, the readline() call never returns because
> >       the crash session gets killed while running in the library code.
> >       (anderson at redhat.com)
> > 
> > But if the orphaned task is repetetively writing the same thing, it
> > would never get to the next readline() call, where it would kill
> > itself.  Taking your example, the "JIFFIES" write() is part of a "timer"
> > command, but I'm trying to understand how/why the command is not just
> > completing a series of (failed) fprintf's, and then falling into
> > the next readline() -- where it should kill itself?  By any chance
> > was the remote caller doing a "repeat" command on the live system,
> > or something like that?  (sounds doubtful since you'd have to have
> > root privileges to do that...)
> > 
> 
> This is not a live system. This is the setup where we analyze vmcores sent in
> by our customers.  I don't understand how it happens either, unless for some reason
> fprintf is re-trying the failed write().  This is not the only failure scenario.
> I just saw another one repeating on this sequence:
> 
> rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
> {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
> rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
> --- SIGFPE (Floating point exception) @ 0 (0) ---
> rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
> {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
> rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
> --- SIGFPE (Floating point exception) @ 0 (0) ---
> rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
> {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
> rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
> --- SIGFPE (Floating point exception) @ 0 (0) ---
> 
> Perhaps it isn't a crash program issue at all. Maybe it's at the
> system library level.

About the closest I can come to reproducing it so far is to run
"kmem -S" on a dumpfile I created with the snap.so extension
module, where the slab subsystem was churning underneath the
snapshot process (a live dump).  Anyway, the command gets into
an endless readmem() loop because of invalid kmem slab bookkeeping
values, and if I kill the network connection I can catch it in a
readmem() loop.  

Now I could check for a parent pid of 1 each time in readmem(),
and kill it there, given readmem() is so regularly called, but 
since you're seeing scenarios that don't show a readmem() in
the loop, that's not going to fly.  Perhaps a better plan would
be to set up prctl(PR_SET_PDEATHSIG, SIGKILL) during initialization,
and hope there's no unwanted side effects.

Dave