fork failed cannot allocate memory
Ben Russo
Ben at muppethouse.com
Mon Feb 23 14:42:00 UTC 2004
I have a Dell 2650, dual Xeon box, 2GB RAM, 6GB swap with PERC Hardware
RAID card. It was running RedHat AS2.1 with the 2.4.9-e.27smp kernel.
Other than the kernel version, the box was fully up2date with patches.
The box doesn't have too much running, apache (idle 99.999% of the time)
ntp, snmp, sshd, netcool object server and some custom perl scripts
(running with NON-ROOT ownership and some persistent "tail" commands
feeding the perl processes).
Every week or two the box will stop allowing new netcool client
connections, and ssh connection attempts result in
ssh fork failed: Cannot allocate memory
If I keep trying every few seconds I can eventually get in, but the
shell is unable to execute the environtment (bash_profile) config files,
it gives a bunch of "fork failed: Cannot allocate memory" errors from
bash and then dumps me at a bash prompt. I can then run shell built-ins
like cd & pwd, and tab-completion works, but I can't run any processes.
I was able to cd /proc, and then use "ls <tab><tab>" to see what pids
were listed in proc. I then just crossed my fingers and tried to kill
one of them... It kept giving me "fork" errors about allocating memory,
but I kept trying again and again until it succeeded in killing something.
Immediately after I killed some process I was able to run top and get
another login! (hurray!) but then the same problem started again.
At least this time I had top running. I could see that the load on the
box was minimal (less than 0.3). The CPU's were 90%+ idle. "top"
showed that the swap space (three 2GB swap partitions for a total of
6GB) was almost completely unused, only about 7250KB of swap was used.
The box has 2GB of RAM, the free command showed that only about 256MB
were being used for processes and process-data. The rest was being used
for cache/buffer, and about 9MB was "free". There were only about 50
processes running on the box. netstat -nap only showed a few hundred
open socket/ports.
lsof | wc -l showed only 1703 files open, vmstat showed no unusual
numbers, the box was basically idle.
Top, sort by RAM and sort by Time showed the Netcool database and a perl
process had both been running for several weeks, but weren't being CPU
hogs, and each was only using about 70MB of RAM.
By killing the httpd process and the ntpd process and then restarting
them the problems appear to have gone away. (I doubt they had anything
to do with the problem, I was just restarting processes to get the
problem to go away)
I don't see this problem on any of the dozens of other boxes I have that
run the same OS on the same hardware. I installed the latest
kernel-2.4.9-e.38 (or 39?) and rebooted last night. Is there some other
metric or system parameter I should be looking at? What else would cause
this problem? This is really bad.
The only thing different on this machine is the netcool procs, perl
processes and scripts and the tail processes that run continuously, but
all of these are running with non-root privileges, how could a non-root
process cause this?
More information about the redhat-list
mailing list