[Linux-cluster] OOM failures with GFS, NFS and Samba on a cluster with RHEL3-AS

Mon Jan 24 21:37:54 UTC 2005

Yet more and more info:

Jan 24 16:17:00 quicksilver kernel: Mem-info:
Jan 24 16:17:00 quicksilver kernel: Zone:DMA freepages:  2835 min:     0 
low:     0 high:     0
Jan 24 16:17:00 quicksilver kernel: Zone:Normal freepages:  1034 min: 
1279 low:  4544 high:  6304
Jan 24 16:17:00 quicksilver kernel: Zone:HighMem freepages:759901 min: 
  255 low: 15872 high: 23808
Jan 24 16:17:00 quicksilver kernel: Free pages:      763768 (759901 HighMem)
Jan 24 16:17:00 quicksilver kernel: ( Active: 22610/25584, 
inactive_laundry: 3922, inactive_clean: 3890, free: 763768 )
Jan 24 16:17:00 quicksilver kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2835
Jan 24 16:17:00 quicksilver kernel:   aa:0 ac:27 id:0 il:115 ic:0 fr:1026
Jan 24 16:17:00 quicksilver kernel:   aa:12742 ac:9847 id:25584 il:3807 
ic:3890 fr:759901
Jan 24 16:17:00 quicksilver kernel: 1*4kB 1*8kB 0*16kB 0*32kB 1*64kB 
0*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11340kB)
Jan 24 16:17:00 quicksilver kernel: 272*4kB 19*8kB 1*16kB 1*32kB 1*64kB 
1*128kB 1*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3784kB)
Jan 24 16:17:01 quicksilver kernel: 43*4kB 17*8kB 2*16kB 7*32kB 1*64kB 
78*128kB 138*256kB 89*512kB 83*1024kB 32*2048kB 683*4096kB =
3039604kB)
Jan 24 16:17:01 quicksilver kernel: Swap cache: add 0, delete 0, find 
0/0, race 0+0
Jan 24 16:17:01 quicksilver kernel: 197629 pages of slabcache
Jan 24 16:17:01 quicksilver kernel: 328 pages of kernel stacks
Jan 24 16:17:01 quicksilver kernel: 0 lowmem pagetables, 529 highmem 
pagetables
Jan 24 16:17:01 quicksilver kernel: Free swap:       2096472kB
Jan 24 16:17:01 quicksilver kernel: 1245184 pages of RAM
Jan 24 16:17:01 quicksilver kernel: 819136 pages of HIGHMEM
Jan 24 16:17:01 quicksilver kernel: 222298 reserved pages
Jan 24 16:17:01 quicksilver kernel: 38487 pages shared
Jan 24 16:17:01 quicksilver kernel: 0 pages swap cached
Jan 24 16:17:01 quicksilver kernel: Out of Memory: Killed process 2441 
(sendmail).
Jan 24 16:17:01 quicksilver kernel: Out of Memory: Killed process 2441 
(sendmail).
Jan 24 16:17:01 quicksilver kernel: Fixed up OOM kill of mm-less task

The machine reports OOM kills for about 15-30 seconds before clumembd 
gets killed and the machine reboots.

The OOM kills usually begin at the top of the minute, though that 
probably doesn't have anything to do with anything except coincidence.

jonathan

Jonathan Woytek wrote:
> /proc/meminfo:
>         total:    used:    free:  shared: buffers:  cached:
> Mem:  4189741056 925650944 3264090112        0 18685952 76009472
> Swap: 2146787328        0 2146787328
> MemTotal:      4091544 kB
> MemFree:       3187588 kB
> MemShared:           0 kB
> Buffers:         18248 kB
> Cached:          74228 kB
> SwapCached:          0 kB
> Active:         107232 kB
> ActiveAnon:      50084 kB
> ActiveCache:     57148 kB
> Inact_dirty:      1892 kB
> Inact_laundry:   16276 kB
> Inact_clean:     16616 kB
> Inact_target:    28400 kB
> HighTotal:     3276544 kB
> HighFree:      3164096 kB
> LowTotal:       815000 kB
> LowFree:         23492 kB
> SwapTotal:     2096472 kB
> SwapFree:      2096472 kB
> Committed_AS:    72244 kB
> HugePages_Total:     0
> HugePages_Free:      0
> Hugepagesize:     2048 kB
> 
> When a bunch of locks become free, lowmem seems to recover somewhat. 
> However, shutting down lock_gulmd entirely does NOT return lowmem to 
> what it probably should be (though I'm not sure if the system is just 
> keeping all of that memory cached until something else needs it or not).
> 
> jonathan
> 
> Jonathan Woytek wrote:
> 
>> Michael Conrad Tadpol Tilstra wrote:
>>
>>> On Sun, Jan 23, 2005 at 01:45:28PM -0500, Jonathan Woytek wrote:
>>>
>>>> Additional information:
>>>>
>>>> I enabled full output on lock_gulmd, since my dead top sessions 
>>>> would often show that process near the top of the list around the 
>>>> time of crashes.  The machine was rebooted around 10:50AM, and was 
>>>> down again at 
>>>
>>>
>>>
>>>
>>> Not suprising that lock_gulmd is working hard when gfs is under heavy
>>> use.  Its it busy processing all those lock requests.  What would be
>>> more useful from gulm for this than the logging messages, is to query
>>> the locktable every so often for its stats.
>>> `gulm_tool getstats <master>:lt000`
>>> The 'locks = ###' line is how many lock structures are current held.
>>> gulm is very greedy about memory, and you are running the lock servers
>>> on the same nodes you're mounting from.
>>
>>
>>
>> Here are the stats from the master lock_gulmd lt000:
>>
>> I_am = Master
>> run time = 9436
>> pid = 2205
>> verbosity = Default
>> id = 0
>> partitions = 1
>> out_queue = 0
>> drpb_queue = 0
>> locks = 20356
>> unlocked = 17651
>> exclusive = 15
>> shared = 2690
>> deferred = 0
>> lvbs = 17661
>> expired = 0
>> lock ops = 107354
>> conflicts = 0
>> incomming_queue = 0
>> conflict_queue = 0
>> reply_queue = 0
>> free_locks = 69644
>> free_lkrqs = 60
>> used_lkrqs = 0
>> free_holders = 109634
>> used_holders = 20366
>> highwater = 1048576
>>
>>
>> Something keeps eating away at lowmem, though, and I still can't 
>> figure out what exactly it is.
>>
>>
>>> also, just to see if I read the first post right, you have
>>> samba->nfs->gfs?
>>
>>
>>
>> If I understand your arrows correctly, I have a filesystem mounted 
>> with GFS that I'm sharing via NFS to another machine that is sharing 
>> it via Samba.  I've closed that link, though, to try to eliminate that 
>> as a problem.  So now I'm serving the GFS filesystem directly through 
>> Samba.
>>
>> jonathan
>>
> 

-- 
Jonathan Woytek                 w: 412-681-3463         woytek+ at cmu.edu
NREC Computing Manager          c: 412-401-1627         KB3HOZ
PGP Key available upon request