[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

fork failed cannot allocate memory

I have a Dell 2650, dual Xeon box, 2GB RAM, 6GB swap with PERC Hardware RAID card. It was running RedHat AS2.1 with the 2.4.9-e.27smp kernel.
Other than the kernel version, the box was fully up2date with patches.

The box doesn't have too much running, apache (idle 99.999% of the time)
ntp, snmp, sshd, netcool object server and some custom perl scripts (running with NON-ROOT ownership and some persistent "tail" commands feeding the perl processes).

Every week or two the box will stop allowing new netcool client connections, and ssh connection attempts result in
	ssh fork failed: Cannot allocate memory
If I keep trying every few seconds I can eventually get in, but the
shell is unable to execute the environtment (bash_profile) config files, it gives a bunch of "fork failed: Cannot allocate memory" errors from bash and then dumps me at a bash prompt. I can then run shell built-ins like cd & pwd, and tab-completion works, but I can't run any processes.

I was able to cd /proc, and then use "ls <tab><tab>" to see what pids were listed in proc. I then just crossed my fingers and tried to kill one of them... It kept giving me "fork" errors about allocating memory, but I kept trying again and again until it succeeded in killing something.

Immediately after I killed some process I was able to run top and get another login! (hurray!) but then the same problem started again. At least this time I had top running. I could see that the load on the box was minimal (less than 0.3). The CPU's were 90%+ idle. "top" showed that the swap space (three 2GB swap partitions for a total of 6GB) was almost completely unused, only about 7250KB of swap was used. The box has 2GB of RAM, the free command showed that only about 256MB were being used for processes and process-data. The rest was being used for cache/buffer, and about 9MB was "free". There were only about 50 processes running on the box. netstat -nap only showed a few hundred open socket/ports.

lsof | wc -l showed only 1703 files open, vmstat showed no unusual numbers, the box was basically idle.

Top, sort by RAM and sort by Time showed the Netcool database and a perl process had both been running for several weeks, but weren't being CPU hogs, and each was only using about 70MB of RAM.

By killing the httpd process and the ntpd process and then restarting them the problems appear to have gone away. (I doubt they had anything to do with the problem, I was just restarting processes to get the problem to go away)

I don't see this problem on any of the dozens of other boxes I have that
run the same OS on the same hardware. I installed the latest kernel-2.4.9-e.38 (or 39?) and rebooted last night. Is there some other metric or system parameter I should be looking at? What else would cause this problem? This is really bad.

The only thing different on this machine is the netcool procs, perl processes and scripts and the tail processes that run continuously, but all of these are running with non-root privileges, how could a non-root process cause this?

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]