[Linux-cluster] GFS (1 & partially 2) performance problems

Mon Jun 14 15:09:30 UTC 2010

Michael,

For comparison, could you do your dd(1) tests with a very large block size (1 MB) and tell us the results, please?

I have a vague hunch that the problem may have something to do with coalescing or not of IO operations.

Also, which IO scheduler are you using?

Thanks abnd regards,

Chris Jankowski

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Michael Lackner
Sent: Tuesday, 15 June 2010 00:22
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Hello!

Thanks for your reply. I unfortunately forgot to mention, HOW I was actually testing, stupid.

I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile size per node.
I read from /dev/zero for writing tests and wrote to /dev/null for reading tests. So, totally sequential, somewhat small blocksize (equal to filesystem BS).

The performance was measured directly on the Fibrechannel Switch, which offers nice per-port monitoring for that purpose.

I have yet to do some serious read testing on GFS2. I have aborted my
GFS2 tests as
write performance was not up to GFS1 to begin with. My older GFS2 benchmarks (i did this with a 2-node configuration before) are lost, I will need to re-do them to give you some numbers.

After each write test I did a "sync" to flush everything to disks.  I did not do this before or after read tests though..

As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that only 2-3% logspace were in use after the tests (I guess this is the per-node fs journal?).

As for the direct I/O tests, by that you mean testing without ANY caching going on, a synchronous write? What I did before was test EXT3 (~190MB/s) and XFS
(~320MB/s)
on the Storage Array. I think what I'm getting here is raw throughput, since I am not monitoring in the OS, but at the Fibrechannel Switch itself..

I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able to do that tomorrow morning, then I can post the numbers here.

Thanks!

Steven Whitehouse wrote:
> Hi,
>
> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>   
>> Hello!
>>
>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>
>> At the moment, the storage subsystem consists of an HP MSA2312 
>> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines 
>> are connected to that switch over 8gbit FC. The disks themselves are
>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>
>> Now, the whole storage shall be shared (single filesystem), here GFS 
>> comes in.
>>
>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>> added later on. I am currently testing GFS1 and GFS2 for performance.
>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>> machine).
>>
>> Thing is, with GFS1 I get far better performance than with the newer
>> GFS2 across the board, with a few tunable parameters set, for writes
>> GFS1 is roughly twice as fast.
>>
>>     
> What tests are you running? GFS2 is generally faster than GFS1 except 
> for streaming writes, which is an area that we are putting some effort 
> into solving currently. Small writes (one fs block (4k default) or 
> less) on GFS2 are much faster than on GFS1.
>
>   
>> But, concurrent reads are totally abysmal. The total write 
>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>> (version 1 at the
>> moment) at the same time,  things turn ugly.
>>
>>     
> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
> something isn't working correctly for some reason. For cached data, 
> reads on GFS2 should be as fast as ext2/3 since the code path is 
> identical (to the page cache) and only changes if pages are not cached.
> GFS1 does its locking at a higher level, so there will be more 
> overhead for cached reads in general.
>
> Do make sure that if you are preparing the test files for reading all 
> from one node (or even just a different node to that on which you sre 
> running the read tests) that you need to sync them to disk on that 
> node before starting the tests to avoid issues with caching.
>
>   
>> This is strange, because for writes, global performance across the 
>> cluster increases slightly when adding more nodes. But for reads, the 
>> oppsite seems to be true.
>>
>> For read and write tests, separate testfiles were created and read 
>> for each node, with each testfile sitting in its own subdirectory, so 
>> no node would access another nodes file.
>>
>>     
> That sounds like a good test set up to me.
>
>   
>> GFS1 created with the following mkfs.gfs parameters:
>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, Distributed 
>> LockManager)
>>
>> Mount Options set: "noatime,nodiratime,noquota"
>>
>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>> demote_secs 20"
>>     
> You shouldn't normally need to set the glock_purge and demote_secs to 
> anything other than the default. These settings no longer exist in 
> GFS2 since it makes use of the shrinker subsystem provided by the VM 
> and is auto-tuning. If your workload is metadata heavy, you could try 
> boosting the journal size and/or the incore_log_blocks tunable.
>
>   
>> Also, in /etc/cluster/cluster.conf, I added this:
>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>> plock_rate_limit="0"/>
>>
>> Any ideas on how to figure out what's going wrong, and how to tune 
>> GFS1 for better concurrent read performance, or tune GFS2 in general 
>> to be competitive/better than GFS1?
>>
>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>> somewhat good reaction times while under heavy sequential and/or 
>> random load. But for now, I just wanna get the seq reading to work 
>> acceptably fast.
>>
>> Thanks a lot for your help!
>>
>>     
> Can you try doing some I/O direct to the block device so that we can 
> get an idea of what the raw device can manage? Using dd both read and 
> write, across the nodes (different disk locations on each node to 
> simulate different files).
>
> I'm wondering if the problem might be due to the seek pattern 
> generated by the multiple read locations,
>
> Steve.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   
--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration michael.lackner at mu-leoben.at | +43 (0)3842/402-1505

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster