fork failed cannot allocate memory

Mon Feb 23 14:42:00 UTC 2004

I have a Dell 2650, dual Xeon box, 2GB RAM, 6GB swap with PERC Hardware 
RAID card. It was running RedHat AS2.1 with the 2.4.9-e.27smp kernel.
Other than the kernel version, the box was fully up2date with patches.

The box doesn't have too much running, apache (idle 99.999% of the time)
ntp, snmp, sshd, netcool object server and some custom perl scripts 
(running with NON-ROOT ownership and some persistent "tail" commands 
feeding the perl processes).

Every week or two the box will stop allowing new netcool client 
connections, and ssh connection attempts result in
	ssh fork failed: Cannot allocate memory
If I keep trying every few seconds I can eventually get in, but the
shell is unable to execute the environtment (bash_profile) config files, 
it gives a bunch of "fork failed: Cannot allocate memory" errors from 
bash and then dumps me at a bash prompt.  I can then run shell built-ins 
like cd & pwd, and tab-completion works, but I can't run any processes.

I was able to cd /proc, and then use "ls <tab><tab>" to see what pids 
were listed in proc.  I then just crossed my fingers and tried to kill 
one of them...  It kept giving me "fork" errors about allocating memory, 
but I kept trying again and again until it succeeded in killing something.

Immediately after I killed some process I was able to run top and get 
another login!  (hurray!)  but then the same problem started again.
At least this time I had top running.  I could see that the load on the 
box was minimal (less than 0.3).  The CPU's were 90%+ idle.  "top" 
showed that the swap space (three 2GB swap partitions for a total of 
6GB) was almost completely unused, only about 7250KB of swap was used.
The box has 2GB of RAM, the free command showed that only about 256MB 
were being used for processes and process-data.  The rest was being used 
for cache/buffer, and about 9MB was "free".  There were only about 50 
processes running on the box.  netstat -nap only showed a few hundred 
open socket/ports.

lsof | wc -l showed only 1703 files open, vmstat showed no unusual 
numbers, the box was basically idle.

Top, sort by RAM and sort by Time showed the Netcool database and a perl 
process had both been running for several weeks, but weren't being CPU 
hogs, and each was only using about 70MB of RAM.

By killing the httpd process and the ntpd process and then restarting 
them the problems appear to have gone away.  (I doubt they had anything 
to do with the problem, I was just restarting processes to get the 
problem to go away)

I don't see this problem on any of the dozens of other boxes I have that
run the same OS on the same hardware. I installed the latest 
kernel-2.4.9-e.38 (or 39?)  and rebooted last night. Is there some other 
metric or system parameter I should be looking at? What else would cause 
this problem?  This is really bad.

The only thing different on this machine is the netcool procs, perl 
processes and scripts and the tail processes that run continuously, but 
all of these are running with non-root privileges, how could a non-root 
process cause this?