[Linux-cluster] GFS + CORAID Performance Problem

Mon Dec 11 00:01:23 UTC 2006

Tom,

I currently administer a system running a similar but larger setup,  
so I may be able to help you.

First, make sure you contact Coraid.  They are really good about  
helping with this stuff.

Second, have you looked at /dev/etherd/err?  There is usually a lot  
of good debugging there.

Third, have you upgraded the firmware in the Coraid and built the  
newest AoE driver?  These are absolutely critical in getting the best  
performance / reliability and generally the plain kernel driver has  
fallen behind.  They assure me they're working on this and I can  
vouch for the fact that this driver is essentially the one in the  
kernel with development necessary to make it work--not some sort of  
vendor supplied out-of-tree driver.

Finally, make sure you have good switches.  I have had a number of  
switches that drop a packet here and there.  These are death to AoE  
performance.  Gigabit is generally a must as well.

On Dec 10, 2006, at 2:03 AM, bigendian+gfs at gmail.com wrote:

> I've just set up a new two-node GFS cluster on a CORAID sr1520 ATA- 
> over-Ethernet.  My nodes are each quad dual-core Opteron CPU  
> systems with 32GB RAM each.  The CORAID unit exports a 1.6TB block  
> device that I have a GFS file system on.
>
> I seem to be having performance issues where certain read system  
> calls take up to three seconds to complete.  My test app is bonnie+ 
> +, and the slow-downs appear to be happen in the "Rewriting"  
> portion of the test, though I'm not sure if this is exclusive.  If  
> I watch top and iostat for the device in question, I see activity  
> on the device, then long (up to three second) periods of no  
> apparent I/O.  During the periods of no I/O the bonnie++ process is  
> blocked on disk I/O, so it seems that the system it trying to do  
> something.  Network traces seem to show that the host machine is  
> not waiting on the RAID array, and the packet following the dead- 
> period seems to always be sent from the host to the coraid device.   
> Unfortunately, I don't know how to dig in any deeper to figure out  
> what the problem is.
>
> Below are strace and tcpdump snippets that show what I'm talking  
> about.  Notice the time stamps and the time spent in system calls  
> in <> brackets after the call.  I'm quite far from a GFS expert, so  
> please let me know if other data would be helpful.
>
> Any help is much appreciated.
>
> Thanks!

-- 
Jayson Vantuyl
Systems Architect
Engine Yard
jvantuyl at engineyard.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20061210/6d032a8e/attachment.htm>