[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: RHEL4 Sun Java Messaging Server deadlock



Date: Fri, 4 Feb 2011 20:47:27 +0100
From: (Imed Chihi) ???? ?????? 	<imed chihi gmail com>
To: redhat-list redhat com
Subject: Re: RHEL4 Sun Java Messaging Server deadlock (was:
	redhat-list	Digest, Vol 84, Issue 3)
Message-ID:
	<AANLkTikZOdvX-tTkd0Z133bJTh+ooqWQp50A+eaHKgX+ mail gmail com>
Content-Type: text/plain; charset=UTF-8

all_unreclaimable is a flag which, when set, tells the virtual memory
daemons not to bother scanning pages in the zone in question in order
to try to free memory.  Anyway, the DMA zone is insignificantly tiny
(16MB) that it cannot possibly have any effect in a 32GB machine.

By the way, there seems to be plenty of free HighMem memory, so the
problem cannot possibly be due to overcommit.

Based on the above, I could suggest two theories to explain what's happening:

1. you have a Normal zone starvation
Try to set vm.lower_zone_protection to something large enough like 100 MB:
sysctl -w vm.lower_zone_protection 100
If this theory is correct, then the setting should fix the issue.

2. you have a pagecache flushing storm
A huge size of dirty pages from the IO of large data sets would stall
the system while being sync'ed to disk.  This typically occurs once
the pagecache size has grown to significant sizes.  Mounting the
filesystem in sync mode (mount -oremount,sync /dev/device) would "fix"
the issue.  However, synchronous IO is painfully slow, but the test
would at least tell where the problem is.  If this turns out to be the
problem, then we could think of other less annoying options for a
bearable fix.

Good luck,

  -Imed


It's baack...

/proc/sys/vm/lower_zone_protection:100

I don't think running for two months with synchronous I/O is an option.

Apr 26 11:08:26 myysumail kernel: cpu 23 hot: low 32, high 96, batch 16
Apr 26 11:08:26 myysumail kernel: cpu 23 cold: low 0, high 32, batch 16
Apr 26 11:08:26 myysumail kernel:
Apr 26 11:08:26 myysumail kernel: Free pages: 20938624kB (20896192kB HighMem) Apr 26 11:08:26 myysumail kernel: Active:1158012 inactive:1741131 dirty:6213 wri
teback:1 unstable:0 free:5234656 slab:162944 mapped:345466 pagetables:6398
Apr 26 11:08:26 myysumail kernel: DMA free:12528kB min:32kB low:64kB high:96kB a ctive:0kB inactive:0kB present:16384kB pages_scanned:0 all_unreclaimable? yes
Apr 26 11:08:26 myysumail kernel: protections[]: 0 398800 424400
Apr 26 11:08:26 myysumail kernel: Normal free:29904kB min:7976kB low:15952kB hig h:23928kB active:520188kB inactive:565376kB present:4014080kB pages_scanned:0 al
l_unreclaimable? no
Apr 26 11:08:26 myysumail kernel: protections[]: 0 0 25600
Apr 26 11:08:26 myysumail kernel: HighMem free:20896192kB min:512kB low:1024kB h igh:1536kB active:4111860kB inactive:6399148kB present:31621120kB pages_scanned:
0 all_unreclaimable? no
Apr 26 11:08:26 myysumail kernel: protections[]: 0 0 0
Apr 26 11:08:26 myysumail kernel: DMA: 4*4kB 6*8kB 3*16kB 2*32kB 3*64kB 3*128kB
2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12528kB
Apr 26 11:08:26 myysumail kernel: Normal: 3830*4kB 717*8kB 85*16kB 8*32kB 11*64k
B 7*128kB 6*256kB 4*512kB 0*1024kB 1*2048kB 0*4096kB = 29904kB
Apr 26 11:08:26 myysumail kernel: HighMem: 2904*4kB 1080*8kB 482*16kB 76*32kB 23 694*64kB 7005*128kB 9663*256kB 3151*512kB 705*1024kB 86*2048kB 3288*4096kB = 208
96192kB
Apr 26 11:08:26 myysumail kernel: 2632760 pagecache pages
Apr 26 11:08:26 myysumail kernel: Swap cache: add 0, delete 0, find 0/0, race 0+
0
Apr 26 11:08:26 myysumail kernel: 0 bounce buffer pages
Apr 26 11:08:26 myysumail kernel: Free swap:       16777208kB
Apr 26 11:08:26 myysumail kernel: 8912896 pages of RAM
Apr 26 11:08:26 myysumail kernel: 7864320 pages of HIGHMEM
Apr 26 11:08:26 myysumail kernel: 597583 reserved pages
Apr 26 11:08:26 myysumail kernel: 1548546 pages shared
Apr 26 11:08:26 myysumail kernel: 0 pages swap cached

 205   205 TS       -   0  24  22  0.0 D    start_this_handle    pdflush
5021  5021 TS       -   0  24   8  0.0 D    journal_commit_trans kjournald

PID: 205    TASK: 81515830  CPU: 22  COMMAND: "pdflush"
#0 [814f3d44] rwsem_down_read_failed at 22d49de
#1 [814f3d98] add_wait_queue_exclusive at 2120dbb
#2 [814f3e5c] dio_bio_end_io at 217b638
#3 [814f3ed0] __pdflush at 21461be
#4 [814f3ee4] sync_sb_inodes at 2179f67
#5 [814f3f28] mpage_end_io_read at 217a485
#6 [814f3f38] dirty_writeback_centisecs_handler at 2145ab1
#7 [814f3f40] do_IRQ at 2107e0a
#8 [814f3f9c] __pdflush at 21462b5
#9 [814f3fd0] kthread_create at 2134227
#10 [814f3ff0] kernel_thread_helper at 21041f3

PID: 5021   TASK: 7fd2adf0  CPU: 8   COMMAND: "kjournald"
#0 [7e85fd84] rwsem_down_read_failed at 22d49de
#1 [7e85fd90] finish_wait at 2120ea5
#2 [7e85fda0] scheduler_tick at 211f273
#3 [7e85fdd8] add_wait_queue_exclusive at 2120dbb
#4 [7e85fe98] find_busiest_group at 211e8f2
#5 [7e85ff04] rwsem_down_read_failed at 22d4a09
#6 [7e85ff0c] finish_wait at 2120ea5
#7 [7e85ff1c] scheduler_tick at 211f273
#8 [7e85ff54] del_timer_sync at 212a271
#9 [7e85ffa4] schedule_tail at 211e12c
#10 [7e85fff0] kernel_thread_helper at 21041f3

If one of the user threads is involved, how can I identify it?
Thanks,
John




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]