[Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS

Sat Jan 22 02:57:19 UTC 2005

I have been experiencing OOM failures (followed by reboots) on a cluster 
running Dell PowerEdge 1860's (dual-proc, 4GB RAM) with RHEL3-AS with 
all current updates.

The system is configured as a two-member cluster, running GFS 6.0.2-25 
(RH SRPM) and cluster services 1.2.16-1 (also RH SRPM).  My original 
testing went fine with the cluster, including service fail-over and all 
that stuff (only one lock_gulmd, so if the master goes down, the world 
explodes--but I expected that).

Use seemed to be okay, but there weren't a whole lot of users. 
Recently, a project wanted to serve some data from their space in GFS 
via their own machine.  We mounted their space via NFS from the cluster, 
and they serve their data via samba from their machine.  Shortly 
thereafter, two things happened:  more people started to access the 
data, and the cluster machines started to crash.  The symptoms are that 
free memory drops extremely quickly (sometimes more than 3GB disappears 
in less than two minutes).  Load average usually goes up quickly (when I 
can see it).  NFS processes are normally at the top of top, along with 
kswapd.  At some point, around this time, the kernel starts to spit out 
OOM messages and it starts to kill bunches of processes.  The machine 
eventually reboots itself and comes back up cleanly.

Space of outages seems to be dependent on how many people are using the 
system, but I've also seen the machine go down when the backup system 
runs a few backups on the machine.  One of the things I've noticed, 
though, is that the backup system doesn't actually cause the machine to 
crash if the system has been recently rebooted, and memory usage returns 
to normal after the backup is finished.  Memory usage usually does NOT 
return to completely normal after the gigabytes of memory become used 
(when that happens, the machine will sit there and keep running for a 
while with only 20MB or less free, until something presumably tries to 
use that memory and the machine flips out).  That is the only time I've 
seen the backup system cause the system to crash--after it has endured 
significant usage during the day and there are 20MB or less free.

I'll usually get a call from the culprits telling me that they were 
copying either a) lots of files or b) large files to the cluster.

Any ideas here?  Anything I can look at to tune?

jonathan