RHEL4 Sun Java Messaging Server deadlock

John Dalbec jpdalbec at ysu.edu
Tue Apr 26 20:52:42 UTC 2011


> Date: Fri, 4 Feb 2011 20:47:27 +0100
> From: (Imed Chihi) ???? ?????? 	<imed.chihi at gmail.com>
> To: redhat-list at redhat.com
> Subject: Re: RHEL4 Sun Java Messaging Server deadlock (was:
> 	redhat-list	Digest, Vol 84, Issue 3)
> Message-ID:
> 	<AANLkTikZOdvX-tTkd0Z133bJTh+ooqWQp50A+eaHKgX+ at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> all_unreclaimable is a flag which, when set, tells the virtual memory
> daemons not to bother scanning pages in the zone in question in order
> to try to free memory.  Anyway, the DMA zone is insignificantly tiny
> (16MB) that it cannot possibly have any effect in a 32GB machine.
>
> By the way, there seems to be plenty of free HighMem memory, so the
> problem cannot possibly be due to overcommit.
>
> Based on the above, I could suggest two theories to explain what's happening:
>
> 1. you have a Normal zone starvation
> Try to set vm.lower_zone_protection to something large enough like 100 MB:
> sysctl -w vm.lower_zone_protection 100
> If this theory is correct, then the setting should fix the issue.
>
> 2. you have a pagecache flushing storm
> A huge size of dirty pages from the IO of large data sets would stall
> the system while being sync'ed to disk.  This typically occurs once
> the pagecache size has grown to significant sizes.  Mounting the
> filesystem in sync mode (mount -oremount,sync /dev/device) would "fix"
> the issue.  However, synchronous IO is painfully slow, but the test
> would at least tell where the problem is.  If this turns out to be the
> problem, then we could think of other less annoying options for a
> bearable fix.
>
> Good luck,
>
>   -Imed
>

It's baack...

/proc/sys/vm/lower_zone_protection:100

I don't think running for two months with synchronous I/O is an option.

Apr 26 11:08:26 myysumail kernel: cpu 23 hot: low 32, high 96, batch 16
Apr 26 11:08:26 myysumail kernel: cpu 23 cold: low 0, high 32, batch 16
Apr 26 11:08:26 myysumail kernel:
Apr 26 11:08:26 myysumail kernel: Free pages:    20938624kB (20896192kB 
HighMem)
Apr 26 11:08:26 myysumail kernel: Active:1158012 inactive:1741131 
dirty:6213 wri
teback:1 unstable:0 free:5234656 slab:162944 mapped:345466 pagetables:6398
Apr 26 11:08:26 myysumail kernel: DMA free:12528kB min:32kB low:64kB 
high:96kB a
ctive:0kB inactive:0kB present:16384kB pages_scanned:0 
all_unreclaimable? yes
Apr 26 11:08:26 myysumail kernel: protections[]: 0 398800 424400
Apr 26 11:08:26 myysumail kernel: Normal free:29904kB min:7976kB 
low:15952kB hig
h:23928kB active:520188kB inactive:565376kB present:4014080kB 
pages_scanned:0 al
l_unreclaimable? no
Apr 26 11:08:26 myysumail kernel: protections[]: 0 0 25600
Apr 26 11:08:26 myysumail kernel: HighMem free:20896192kB min:512kB 
low:1024kB h
igh:1536kB active:4111860kB inactive:6399148kB present:31621120kB 
pages_scanned:
0 all_unreclaimable? no
Apr 26 11:08:26 myysumail kernel: protections[]: 0 0 0
Apr 26 11:08:26 myysumail kernel: DMA: 4*4kB 6*8kB 3*16kB 2*32kB 3*64kB 
3*128kB
2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12528kB
Apr 26 11:08:26 myysumail kernel: Normal: 3830*4kB 717*8kB 85*16kB 
8*32kB 11*64k
B 7*128kB 6*256kB 4*512kB 0*1024kB 1*2048kB 0*4096kB = 29904kB
Apr 26 11:08:26 myysumail kernel: HighMem: 2904*4kB 1080*8kB 482*16kB 
76*32kB 23
694*64kB 7005*128kB 9663*256kB 3151*512kB 705*1024kB 86*2048kB 
3288*4096kB = 208
96192kB
Apr 26 11:08:26 myysumail kernel: 2632760 pagecache pages
Apr 26 11:08:26 myysumail kernel: Swap cache: add 0, delete 0, find 0/0, 
race 0+
0
Apr 26 11:08:26 myysumail kernel: 0 bounce buffer pages
Apr 26 11:08:26 myysumail kernel: Free swap:       16777208kB
Apr 26 11:08:26 myysumail kernel: 8912896 pages of RAM
Apr 26 11:08:26 myysumail kernel: 7864320 pages of HIGHMEM
Apr 26 11:08:26 myysumail kernel: 597583 reserved pages
Apr 26 11:08:26 myysumail kernel: 1548546 pages shared
Apr 26 11:08:26 myysumail kernel: 0 pages swap cached

  205   205 TS       -   0  24  22  0.0 D    start_this_handle    pdflush
5021  5021 TS       -   0  24   8  0.0 D    journal_commit_trans kjournald

PID: 205    TASK: 81515830  CPU: 22  COMMAND: "pdflush"
#0 [814f3d44] rwsem_down_read_failed at 22d49de
#1 [814f3d98] add_wait_queue_exclusive at 2120dbb
#2 [814f3e5c] dio_bio_end_io at 217b638
#3 [814f3ed0] __pdflush at 21461be
#4 [814f3ee4] sync_sb_inodes at 2179f67
#5 [814f3f28] mpage_end_io_read at 217a485
#6 [814f3f38] dirty_writeback_centisecs_handler at 2145ab1
#7 [814f3f40] do_IRQ at 2107e0a
#8 [814f3f9c] __pdflush at 21462b5
#9 [814f3fd0] kthread_create at 2134227
#10 [814f3ff0] kernel_thread_helper at 21041f3

PID: 5021   TASK: 7fd2adf0  CPU: 8   COMMAND: "kjournald"
#0 [7e85fd84] rwsem_down_read_failed at 22d49de
#1 [7e85fd90] finish_wait at 2120ea5
#2 [7e85fda0] scheduler_tick at 211f273
#3 [7e85fdd8] add_wait_queue_exclusive at 2120dbb
#4 [7e85fe98] find_busiest_group at 211e8f2
#5 [7e85ff04] rwsem_down_read_failed at 22d4a09
#6 [7e85ff0c] finish_wait at 2120ea5
#7 [7e85ff1c] scheduler_tick at 211f273
#8 [7e85ff54] del_timer_sync at 212a271
#9 [7e85ffa4] schedule_tail at 211e12c
#10 [7e85fff0] kernel_thread_helper at 21041f3

If one of the user threads is involved, how can I identify it?
Thanks,
John





More information about the redhat-list mailing list