[Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS
Jonathan Woytek
woytek+ at cmu.edu
Mon Jan 24 04:27:52 UTC 2005
Even more additional information:
I've been monitoring the system through a few crashes now, and it looks
like what is actually running out of memory is "lowmem". The system
seems to eat about 130-140kB every two seconds. It seems that the
system is NOT actually plowing through 3GB+ of memory--highmem does not
seem to drop.
Whee fun.
jonathan
Jonathan Woytek wrote:
> Additional information:
>
> I enabled full output on lock_gulmd, since my dead top sessions would
> often show that process near the top of the list around the time of
> crashes. The machine was rebooted around 10:50AM, and was down again at
> 12:44. In the span of less than a minute, the machine plowed through
> over 3GB of memory and crashed. The extra debugging information from
> lock_gulmd said nothing, except that there was a successful heartbeat.
> The OOM messages began at 12:44:01, and the machine was dead somewhere
> around 12:44:40. Nobody should be using the machine during this time. A
> cron job that was scheduled to fire off at 12:44 (it runs every two
> minutes to check memory usage, specifically to try to track this
> problem) did not run (or at least was not logged if it did). I took
> that job out of cron just to make sure that it isn't part of the
> problem. The low-memory-check that ran at 12:42 reported nothing, and
> my threshold for that is set at 512MB.
>
> The span between crashes this weekend has been between three and eight
> hours. Yesterday, the machine rebooted (looking at lastlog, not last
> message before restart in /var/log/messages, but I'll be looking at that
> in a bit) at 15:20 (after being up since 23:50 on Friday), 18:27, 21:43,
> onto sunday at 01:14, 04:33, and finally 12:48. Something seems quite
> wrong with this.
>
> jonathan
>
>
> Jonathan Woytek wrote:
>
>> I have been experiencing OOM failures (followed by reboots) on a
>> cluster running Dell PowerEdge 1860's (dual-proc, 4GB RAM) with
>> RHEL3-AS with all current updates.
>>
>> The system is configured as a two-member cluster, running GFS 6.0.2-25
>> (RH SRPM) and cluster services 1.2.16-1 (also RH SRPM). My original
>> testing went fine with the cluster, including service fail-over and
>> all that stuff (only one lock_gulmd, so if the master goes down, the
>> world explodes--but I expected that).
>>
>> Use seemed to be okay, but there weren't a whole lot of users.
>> Recently, a project wanted to serve some data from their space in GFS
>> via their own machine. We mounted their space via NFS from the
>> cluster, and they serve their data via samba from their machine.
>> Shortly thereafter, two things happened: more people started to
>> access the data, and the cluster machines started to crash. The
>> symptoms are that free memory drops extremely quickly (sometimes more
>> than 3GB disappears in less than two minutes). Load average usually
>> goes up quickly (when I can see it). NFS processes are normally at
>> the top of top, along with kswapd. At some point, around this time,
>> the kernel starts to spit out OOM messages and it starts to kill
>> bunches of processes. The machine eventually reboots itself and comes
>> back up cleanly.
>>
>> Space of outages seems to be dependent on how many people are using
>> the system, but I've also seen the machine go down when the backup
>> system runs a few backups on the machine. One of the things I've
>> noticed, though, is that the backup system doesn't actually cause the
>> machine to crash if the system has been recently rebooted, and memory
>> usage returns to normal after the backup is finished. Memory usage
>> usually does NOT return to completely normal after the gigabytes of
>> memory become used (when that happens, the machine will sit there and
>> keep running for a while with only 20MB or less free, until something
>> presumably tries to use that memory and the machine flips out). That
>> is the only time I've seen the backup system cause the system to
>> crash--after it has endured significant usage during the day and there
>> are 20MB or less free.
>>
>> I'll usually get a call from the culprits telling me that they were
>> copying either a) lots of files or b) large files to the cluster.
>>
>> Any ideas here? Anything I can look at to tune?
>>
>> jonathan
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> http://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list