How do I debug/troubleshoot a crashing system ?

Kim Lux lux at diesel-research.com
Fri Jan 28 16:35:59 UTC 2005


I found the problem the other day: I was running ndiswrapper with a
kernel using a 4K stack.  

It would operate fine most of the time but if the network load got just
right it would silently crash the kernel.  The thing that drove it over
the edge, ie started it crashing regularly was when I started doing some
light NATing work with it.  Then I could almost crash it at will. 

I just happened to find the cause of the crashing when I installed a new
kernel and was building ndiswrapper.  I happened to notice that there
was a warning in the build messages about some drivers not working with
a kernel using a 4K stack. 

I built a custom kernel with an 8K stack and I've been running crash
free for 3 days, NATing and all. 

I think if I ran into this sort of thing again I would build a custom
kernel with all of the debugging features turned on and maybe run with a
serial console to capture the debugger output stream.  





On Wed, 2005-01-26 at 18:48 -0700, Wes Shull wrote:
> On Tue, 2005-01-25 at 19:09 -0700, Kim Lux wrote:
> > How do I figure out what is causing the problem ?  I've checked the
> > system logs, but they are clean.
> 
> With lots of crashes lately but never an oops or panic message to
> report, I was about to have the same question, but just to be safe I
> left memtest86 running today, and found bad ram :(
> 
> I'm running with mem=236M for now to block out the bad parts, but has
> there been an RFE for the badram kernel patch?  (not seeing any on
> bugzilla, not even closed GOAWAY or BADIDEA or whatever)  We've
> already got a version of memtest86 that can spit out the badram
> values...  Assuming the labor of maintaining it in the patchset isn't
> too high, I think it's probably a better thing to recognize that
> people are going to use imperfect hardware and give them a way to deal
> with it, than to decide that everyone needs new hardware.  (start
> flamewar now)
> 
> http://rick.vanrein.org/linux/badram/
> 
> If that turns out not to be the (only) problem, what *is* the best way
> to get debug info from bad crashes, where even alt-sysrq-jitsu does no
> good?  I know about the serial console capability; lately I've also
> seen stuff about diskdump and netdump...  which of these is most
> likely to survive serious kernel problems long enough to get a useful
> report that can be bugzilla'ed?
> 
-- 
Kim Lux,  Diesel Research Inc.





More information about the fedora-test-list mailing list