Nvidia Signal 11 error

Nifty Hat Mitch mitch48 at sbcglobal.net
Sat Jan 8 12:39:06 UTC 2005


On Wed, Jan 05, 2005 at 07:03:37PM +0200, Chadley Wilson wrote:

> I get quite an odd error with my nvidia mx440se
...
> all my 3d apps run for a while then suddenly terminate, no errors
> reported, logs are empty.  I have run the apps from a terminal, in
> Quake3 Arena and celestia I get this when the apps terminate.
> Received signal 11, exiting...

Signal 11 is simply:

	 SIGSEGV      11       Core    Invalid memory reference

If the application runs for a while.... then terminates a couple
things can be going on.

Hardware problem.
	 DRAM, VRAM, thermal run away.  Make sure that fans and heat
	 sinks are clean and functional.  Modern chips (CMOS) generate
	 most of their heat when gates change state.  Computational or
	 logic busy applications will heat up parts to failure.  Make
	 sure BIOS settings are sane and avoid overclocking.  The
	 symptom is that things run for a while then terminate.
	 AGP board, 2x, 4x, 8x... what is the BIOS permitting
	 what is being selected when the driver loads.

Library collision.
	 nVidia 3D libraries and Mesa libraries occupy the same name
	 space.  It is possible for lots of things to work with
	 incorrect libraries involved.  One hint is that the nVidia
	 installer makes noise that the installation has been modified
	 if you reinstall the driver+libs.  In my limited experience
	 nVidia has a library structure that can execute in hardware
	 or in software.  Simple things without race conditions or
	 side effects will run just fine.  When things get busy the
	 mixed-up libraries trip up.  The symptom is that things run
	 for a while then terminate.

Memory leak.
         Applications and drivers can fail to reuse memory correctly
         and can continue to allocate additional memory resources.  As
         memory is exhausted bad things can happen, i.e. it will run
         then fall over.  The symptom is that things run for a while
         then terminate.

Asynchronous bug.
	 Some events including interrupts happen at odd times.  In
	 some code there are race conditions between the validation of
	 a memory block and it's use.  This can be the application or
	 the kernel (or both sort of).  If I recall quake, arena and
	 company do lots of texture mapping, and lots of texture map
	 data transfer, with asynchronous signaling (hardware and
	 software), increased heat of chips and more.  Some of these
	 might be more common on multi-processor systems.  The symptom
	 is that things run for a while then terminate.

Kernel bugs:
         kernel-2.6.9-1.11_FC2.i686.rpm contains this comment. 
           * Thu Dec 16 2004
           - Better version of the PCI Posting fixes for agpgart.
           - Add missing cache flush to the AGP code.
         So try different kernel versions and tell us which you
         are using.  Always update and  test the latest rpm.  

Can you check for memory leaking?  Can you compile or run under a
debugger? Can you run with debugging libraries (symbols).  Can you
enable a core dump and report the stack trace?

-- 
	T o m  M i t c h e l l 
	spam unwanted email.
	SPAM, good eats, and a trademark of  Hormel Foods.




More information about the fedora-list mailing list