[Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS

Jonathan Woytek woytek+ at cmu.edu
Mon Jan 24 04:27:52 UTC 2005


Even more additional information:

I've been monitoring the system through a few crashes now, and it looks 
like what is actually running out of memory is "lowmem".  The system 
seems to eat about 130-140kB every two seconds.  It seems that the 
system is NOT actually plowing through 3GB+ of memory--highmem does not 
seem to drop.

Whee fun.

jonathan


Jonathan Woytek wrote:

> Additional information:
> 
> I enabled full output on lock_gulmd, since my dead top sessions would 
> often show that process near the top of the list around the time of 
> crashes.  The machine was rebooted around 10:50AM, and was down again at 
> 12:44.  In the span of less than a minute, the machine plowed through 
> over 3GB of memory and crashed.  The extra debugging information from 
> lock_gulmd said nothing, except that there was a successful heartbeat. 
> The OOM messages began at 12:44:01, and the machine was dead somewhere 
> around 12:44:40.  Nobody should be using the machine during this time. A 
> cron job that was scheduled to fire off at 12:44 (it runs every two 
> minutes to check memory usage, specifically to try to track this 
> problem) did not run (or at least was not logged if it did).  I took 
> that job out of cron just to make sure that it isn't part of the 
> problem.  The low-memory-check that ran at 12:42 reported nothing, and 
> my threshold for that is set at 512MB.
> 
> The span between crashes this weekend has been between three and eight 
> hours.  Yesterday, the machine rebooted (looking at lastlog, not last 
> message before restart in /var/log/messages, but I'll be looking at that 
> in a bit) at 15:20 (after being up since 23:50 on Friday), 18:27, 21:43, 
>  onto sunday at 01:14, 04:33, and finally 12:48.  Something seems quite 
> wrong with this.
> 
> jonathan
> 
> 
> Jonathan Woytek wrote:
> 
>> I have been experiencing OOM failures (followed by reboots) on a 
>> cluster running Dell PowerEdge 1860's (dual-proc, 4GB RAM) with 
>> RHEL3-AS with all current updates.
>>
>> The system is configured as a two-member cluster, running GFS 6.0.2-25 
>> (RH SRPM) and cluster services 1.2.16-1 (also RH SRPM).  My original 
>> testing went fine with the cluster, including service fail-over and 
>> all that stuff (only one lock_gulmd, so if the master goes down, the 
>> world explodes--but I expected that).
>>
>> Use seemed to be okay, but there weren't a whole lot of users. 
>> Recently, a project wanted to serve some data from their space in GFS 
>> via their own machine.  We mounted their space via NFS from the 
>> cluster, and they serve their data via samba from their machine.  
>> Shortly thereafter, two things happened:  more people started to 
>> access the data, and the cluster machines started to crash.  The 
>> symptoms are that free memory drops extremely quickly (sometimes more 
>> than 3GB disappears in less than two minutes).  Load average usually 
>> goes up quickly (when I can see it).  NFS processes are normally at 
>> the top of top, along with kswapd.  At some point, around this time, 
>> the kernel starts to spit out OOM messages and it starts to kill 
>> bunches of processes.  The machine eventually reboots itself and comes 
>> back up cleanly.
>>
>> Space of outages seems to be dependent on how many people are using 
>> the system, but I've also seen the machine go down when the backup 
>> system runs a few backups on the machine.  One of the things I've 
>> noticed, though, is that the backup system doesn't actually cause the 
>> machine to crash if the system has been recently rebooted, and memory 
>> usage returns to normal after the backup is finished.  Memory usage 
>> usually does NOT return to completely normal after the gigabytes of 
>> memory become used (when that happens, the machine will sit there and 
>> keep running for a while with only 20MB or less free, until something 
>> presumably tries to use that memory and the machine flips out).  That 
>> is the only time I've seen the backup system cause the system to 
>> crash--after it has endured significant usage during the day and there 
>> are 20MB or less free.
>>
>> I'll usually get a call from the culprits telling me that they were 
>> copying either a) lots of files or b) large files to the cluster.
>>
>> Any ideas here?  Anything I can look at to tune?
>>
>> jonathan
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> http://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list