RHEL4 Sun Java Messaging Server deadlock (was: redhat-list Digest, Vol 84, Issue 3)

Fri Feb 4 19:47:27 UTC 2011

> Date: Wed, 02 Feb 2011 13:14:02 -0500
> From: John Dalbec <jpdalbec at ysu.edu>
> To: redhat-list at redhat.com
> Subject: Re: RHEL4 Sun Java Messaging Server deadlock
> Message-ID: <4D499EEA.5000002 at ysu.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 1/6/2011 12:00 PM, redhat-list-request at redhat.com wrote:
>> Message: 3
>> Date: Wed, 5 Jan 2011 22:37:29 +0100
>> From: (Imed Chihi) ???? ??????        <imed.chihi at gmail.com>
>> To:redhat-list at redhat.com
>> Subject: Re: RHEL4 Sun Java Messaging Server deadlock (was:
>>       redhat-list     Digest, Vol 83, Issue 3)
>>> >  Date: Tue, 04 Jan 2011 12:46:47 -0500
>>> >  From: John Dalbec<jpdalbec at ysu.edu>
>>> >  To:redhat-list at redhat.com
>>> >  Subject: Re: redhat-list Digest, Vol 83, Issue 2
>>> >
>>> >  I got Alt+SysRq+t output, but it looks corrupted in /var/log/messages.
>>> >  I suspect that syslogd couldn't keep up with klogd and the ring buffer
>>> >  wrapped. ?The system is not starved for CPU, but then I have 24 cores.
>>> >  If 32-bit +>  16GB is trouble then why does the kernel-hugemem package
>>> >  even exist? ?Or is that actually a 64-bit kernel?
>> The kernel-hugemem is a 32-bit-only kernel.  When RHEL 4 was released
>> (around February 2005), 64-bit systems we not that ubiquitous.  It
>> still made sense to use a 32-bit OS with the typical 8 GB servers.
>>
>> The context has changed now, and Red Hat no longer support more than
>> 16 GB on 32-bit platforms as of RHEL 5.
>>
>> The output from Alt+SysRq+m should always fit in a kernel log buffer.
>> Try collecting that one to be certain that the issue is related to VM
>> management.  However, I'd seriously suggest moving to RHEL 4 for
>> x86_64 as you should still be able to run your 32-bit application,
>> minus a whole class of VM hassles.
>>
>>   -Imed
>
> The SysRq-m output had some per-cpu stuff that I don't trust because
> every core was reporting the same numbers.  The system-wide information
> follows.  Under DMA it says "all_unreclaimable? yes" but there appears
> to be plenty of free space.  Do you see any problems?
> Thanks,
> John
>
> Free pages:    26020480kB (25984832kB HighMem)
> Active:719264 inactive:873204 dirty:2982 writeback:0 unstable:0
> free:6505120 slab:198687 mapped:367369 pagetables:6741
> DMA free:12528kB min:32kB low:64kB high:96kB active:0kB inactive:0kB
> present:16384kB pages_scanned:0 all_unreclaimable? yes
> protections[]: 0 0 0
> Normal free:23120kB min:7976kB low:15952kB high:23928kB active:443852kB
> inactive:505268kB present:4014080kB pages_scanned:0 all_unreclaimable? no
> protections[]: 0 0 0
> HighMem free:25984832kB min:512kB low:1024kB high:1536kB
> active:2433204kB inactive:2987548kB present:31621120kB pages_scanned:0
> all_unreclaimable? no
> protections[]: 0 0 0
> DMA: 4*4kB 6*8kB 3*16kB 2*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB
> 1*2048kB 2*4096kB = 12528kB
> Normal: 1982*4kB 867*8kB 22*16kB 7*32kB 4*64kB 2*128kB 2*256kB 1*512kB
> 2*1024kB 2*2048kB 0*4096kB = 23120kB
> HighMem: 6*4kB 8975*8kB 22157*16kB 25695*32kB 44932*64kB 21426*128kB
> 9300*256kB 4706*512kB 3330*1024kB 1529*2048kB 1901*4096kB = 25984832kB
> 1293500 pagecache pages
> Swap cache: add 0, delete 0, find 0/0, race 0+0
> 0 bounce buffer pages
> Free swap:       16777208kB
> 8912896 pages of RAM
> 7864320 pages of HIGHMEM
> 597583 reserved pages
> 719482 pages shared
> 0 pages swap cached

all_unreclaimable is a flag which, when set, tells the virtual memory
daemons not to bother scanning pages in the zone in question in order
to try to free memory.  Anyway, the DMA zone is insignificantly tiny
(16MB) that it cannot possibly have any effect in a 32GB machine.

By the way, there seems to be plenty of free HighMem memory, so the
problem cannot possibly be due to overcommit.

Based on the above, I could suggest two theories to explain what's happening:

1. you have a Normal zone starvation
Try to set vm.lower_zone_protection to something large enough like 100 MB:
sysctl -w vm.lower_zone_protection 100
If this theory is correct, then the setting should fix the issue.

2. you have a pagecache flushing storm
A huge size of dirty pages from the IO of large data sets would stall
the system while being sync'ed to disk.  This typically occurs once
the pagecache size has grown to significant sizes.  Mounting the
filesystem in sync mode (mount -oremount,sync /dev/device) would "fix"
the issue.  However, synchronous IO is painfully slow, but the test
would at least tell where the problem is.  If this turns out to be the
problem, then we could think of other less annoying options for a
bearable fix.

Good luck,

 -Imed

-- 
Imed Chihi - عماد الشيحي
http://perso.hexabyte.tn/ichihi/