[Linux-cluster] GFS + CORAID Performance Problem
Jayson Vantuyl
jvantuyl at engineyard.com
Mon Dec 11 00:01:23 UTC 2006
Tom,
I currently administer a system running a similar but larger setup,
so I may be able to help you.
First, make sure you contact Coraid. They are really good about
helping with this stuff.
Second, have you looked at /dev/etherd/err? There is usually a lot
of good debugging there.
Third, have you upgraded the firmware in the Coraid and built the
newest AoE driver? These are absolutely critical in getting the best
performance / reliability and generally the plain kernel driver has
fallen behind. They assure me they're working on this and I can
vouch for the fact that this driver is essentially the one in the
kernel with development necessary to make it work--not some sort of
vendor supplied out-of-tree driver.
Finally, make sure you have good switches. I have had a number of
switches that drop a packet here and there. These are death to AoE
performance. Gigabit is generally a must as well.
On Dec 10, 2006, at 2:03 AM, bigendian+gfs at gmail.com wrote:
> I've just set up a new two-node GFS cluster on a CORAID sr1520 ATA-
> over-Ethernet. My nodes are each quad dual-core Opteron CPU
> systems with 32GB RAM each. The CORAID unit exports a 1.6TB block
> device that I have a GFS file system on.
>
> I seem to be having performance issues where certain read system
> calls take up to three seconds to complete. My test app is bonnie+
> +, and the slow-downs appear to be happen in the "Rewriting"
> portion of the test, though I'm not sure if this is exclusive. If
> I watch top and iostat for the device in question, I see activity
> on the device, then long (up to three second) periods of no
> apparent I/O. During the periods of no I/O the bonnie++ process is
> blocked on disk I/O, so it seems that the system it trying to do
> something. Network traces seem to show that the host machine is
> not waiting on the RAID array, and the packet following the dead-
> period seems to always be sent from the host to the coraid device.
> Unfortunately, I don't know how to dig in any deeper to figure out
> what the problem is.
>
> Below are strace and tcpdump snippets that show what I'm talking
> about. Notice the time stamps and the time spent in system calls
> in <> brackets after the call. I'm quite far from a GFS expert, so
> please let me know if other data would be helpful.
>
> Any help is much appreciated.
>
> Thanks!
--
Jayson Vantuyl
Systems Architect
Engine Yard
jvantuyl at engineyard.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061210/6d032a8e/attachment.htm>
More information about the Linux-cluster
mailing list