[Linux-cluster] GFS + CORAID Performance Problem

Mon Dec 11 06:16:54 UTC 2006

bigendian+gfs at gmail.com wrote:

> Thanks Wendy!  I don't know if my second post made it since I sent it 
> from the wrong alias, but here's a little more information I've gathered:

Great job ! Glad to have such an educated user who even supplied sysrq-t 
without being asked ... Look like (vm) flushing is part of the culprit. 
Note that since you have such huge amont of memory, the vm is slow to 
reclaim the pages (it thinks you have enough free pages). Then all of a 
sudden, when it decides to flush, it immediately creates a "burst mode" 
that creates this latency issue. One way to get around this is to ask 
the flush daemon to flush more (and more often) by setting the vm dirty 
ratio to a lower value. The trick is to try to flush the dirty pages in 
a more uniform manner, instead of getting itself into burst mode. I'm 
not sure which kernel you're running. For RHEL4 (and 2.6 based comunity 
version of kernels), there are few tunables in /proc/sys/vm directory 
you can play around with. Two most useful ones are:

shell> echo 50 > /proc/sys/vm/dirty_background_ratio
shell> echo 80 > /proc/sys/vm/dirty_ratio

The 80, 50 are two numbers that I make-up to show you the usage. You 
have to play around with them (but make sure to remember its 
original-default value). Now, no need to set mem=<smaller value> in your 
kernel parameter. I was just too lazy to explain (and/or write 
instructions about sysra-t) in previous mail about this.

If you cannot get improvements by this tuning, your next homework is to 
give us oprofile output ...:) ...

-- Wendy

>
> * The I/O blocking bahavior seems to be isolated to a single node in 
> my two node cluster.  The other node shows no symptoms even when 
> running bonnie++ at the same time in a different directory.
> * Shutting down the cluster on the "good" machine doesn't improve 
> things on the other.
> * Shutting down the cluster on the "bad" machine doesn't hurt 
> performance on the other.
> * It appears that only read system calls are blocking during the 
> "Rewriting" portion of the bonnie++ suite.
> * Stopping bonnie++ on the node with blocking issues results in 15+ 
> seconds of high I/O activity where pdflush and gfs_logd (and 
> gfs_inoded IIRC) are quite busy.
> * The read delay is sometimes as long as 10 seconds when it happens, 
> which seems to be every few seconds.
> * Rebooting for good measure didn't make a difference.
>
> I don't know if this is helpful, but here's some data I captured from 
> a "echo t > /proc/sysrq-trigger":
>
> Dec 10 17:31:25 gfs03 kernel: pdflush       D ffffffff8014b24f     0   
> 122     18           123    86 (L-TLB)
> Dec 10 17:31:25 gfs03 kernel: 0000010629ebbc78 0000000000000046 
> 0000010629fe69c0 0000010227bc3f00
> Dec 10 17:31:25 gfs03 kernel:        0000000000000216 ffffffffa00e798a 
> 0000010829d660c0 0000000200000008
> Dec 10 17:31:25 gfs03 kernel:        0000010629e7b7f0 000000000000120d
> Dec 10 17:31:25 gfs03 kernel: Call 
> Trace:<ffffffffa00e798a>{:dm_mod:dm_request+396} 
> <ffffffffa0229db8>{:gfs:diaper_make_request+162}
> Dec 10 17:31:25 gfs03 kernel:        
> <ffffffff8014b24f>{keventd_create_kthread+0} 
> <ffffffff8030a11f>{io_schedule+38}
> Dec 10 17:31:25 gfs03 kernel:        
> <ffffffffa00e79c5>{:dm_mod:dm_unplug_all+0} 
> <ffffffff80179d24>{__wait_on_buffer+125}
> Dec 10 17:31:25 gfs03 kernel:        
> <ffffffff80179baa>{bh_wake_function+0} 
> <ffffffff80179baa>{bh_wake_function+0}
> Dec 10 17:31:25 gfs03 kernel:        
> <ffffffffa022bc67>{:gfs:gfs_logbh_wait+49} 
> <ffffffffa024099a>{:gfs:disk_commit+794}
> Dec 10 17:31:25 gfs03 kernel:        
> <ffffffffa0240b6b>{:gfs:log_refund+111} 
> <ffffffffa0241082>{:gfs:log_flush_internal+510}
> Dec 10 17:31:25 gfs03 kernel:        
> <ffffffff8017e756>{sync_supers+167} <ffffffff8015f20a>{wb_kupdate+36}
> Dec 10 17:31:25 gfs03 kernel:        <ffffffff8015fcb0>{pdflush+323} 
> <ffffffff8015f1e6>{wb_kupdate+0}
> Dec 10 17:31:25 gfs03 kernel:        <ffffffff8015fb6d>{pdflush+0} 
> <ffffffff8014b226>{kthread+199}
> Dec 10 17:31:25 gfs03 kernel:        <ffffffff80110f47>{child_rip+8} 
> <ffffffff8014b24f>{keventd_create_kthread+0}
> Dec 10 17:31:25 gfs03 kernel:        <ffffffff8014b15f>{kthread+0} 
> <ffffffff80110f3f>{child_rip+0}
>
>
> I'll try again with 4GB memory to see if that changes anything.  I'm 
> mostly puzzled by how this is happening only on one node.
>
> Thanks again for the help! 
> Tom
>
>
>
> On 12/10/06, *Wendy Cheng* <wcheng at redhat.com 
> <mailto:wcheng at redhat.com>> wrote:
>
>     Wendy Cheng wrote:
>
>     > Wendy Cheng wrote:
>     >
>     >> bigendian+gfs at gmail.com <mailto:bigendian+gfs at gmail.com> wrote:
>     >>
>     >>> I've just set up a new two-node GFS cluster on a CORAID sr1520
>     >>> ATA-over-Ethernet.  My nodes are each quad dual-core Opteron CPU
>     >>> systems with 32GB RAM each.  The CORAID unit exports a 1.6TB block
>     >>> device that I have a GFS file system on.
>     >>>
>     >>> I seem to be having performance issues where certain read system
>     >>> calls take up to three seconds to complete.  My test app is
>     >>> bonnie++, and the slow-downs appear to be happen in the
>     "Rewriting"
>     >>> portion of the test, though I'm not sure if this is
>     exclusive.  If I
>     >>> watch top and iostat for the device in question, I see
>     activity on
>     >>> the device, then long (up to three second) periods of no apparent
>     >>> I/O.  During the periods of no I/O the bonnie++ process is blocked
>     >>> on disk I/O, so it seems that the system it trying to do
>     something.
>     >>> Network traces seem to show that the host machine is not
>     waiting on
>     >>> the RAID array, and the packet following the dead-period seems to
>     >>> always be sent from the host to the coraid
>     device.  Unfortunately, I
>     >>> don't know how to dig in any deeper to figure out what the
>     problem is.
>     >>
>     >>
>     > Wait ... sorry, I didn't read carefully... now I see that 3
>     seconds in
>     > the strace. That doesn't look like a bonnie++ issue.... Does
>     bonnie++
>     > run on single node ? Or you dispatch them on both nodes (on
>     different
>     > directories) ? This is more complicated than that I originally
>     > expected (since this is a network block device ?). Need to think
>     how
>     > to catch the culprit... could be memory issue though. Could you
>     try to
>     > run bonnie++ on 4G of memory to see how whether you can see
>     there are
>     > 3 seconds read delay ?
>     >
>     Hit send key too soon ... my words are cluttered. Note that reducing
>     memory from 32G to 4G may sound funny but there are VM issues behind
>     this. So it is a quick and dirty experiment.
>
>     -- Wendy
>
>     --
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>