[Linux-cluster] GFS (1 & partially 2) performance problems

Mon Jun 14 14:48:13 UTC 2010

Hi,

On Mon, 2010-06-14 at 16:21 +0200, Michael Lackner wrote:
> Hello!
> 
> Thanks for your reply. I unfortunately forgot to mention, HOW I was 
> actually testing,
> stupid.
> 
> I tested with dd, doing 4kB blocksize reads and writes, 160GB total 
> testfile size per node.
> I read from /dev/zero for writing tests and wrote to /dev/null for 
> reading tests. So, totally
> sequential, somewhat small blocksize (equal to filesystem BS).
> 
> The performance was measured directly on the Fibrechannel Switch, which 
> offers nice
> per-port monitoring for that purpose.
> 
> I have yet to do some serious read testing on GFS2. I have aborted my 
> GFS2 tests as
> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks
> (i did this with a 2-node configuration before) are lost, I will need to 
> re-do them to
> give you some numbers.
> 
Ok, so these are streaming writes, and plenty large enough to be
affected by the gfs2 performance issue. The reason we have that issue in
GFS2 but not GFS1 is that the lock ordering is different. We try to make
maximum use of the page cache in GFS2 which gives us the faster reads,
but also (due to page-at-a-time write code) the slower streaming writes.
The smaller writes are faster because the overall overhead for writing
is lower on GFS2. However that overhead is per-page written on GFS2, but
per-write call on GFS1 which results in the slower writes which
streaming on GFS2.

It is pretty tricky to fix because it requires being able to do
multi-page writes which are problematic due to the (page) locking order
requirements.

> After each write test I did a "sync" to flush everything to disks.  I 
> did not do this before
> or after read tests though..
> 
> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, 
> that only 2-3%
> logspace were in use after the tests (I guess this is the per-node fs 
> journal?).
> 
You need to measure the log space during the tests rather than at the
end, but since you are doing streaming writes, the amount of metadata is
relatively small anyway, so thats probably not an issue.

> As for the direct I/O tests, by that you mean testing without ANY 
> caching going on, a
> synchronous write? What I did before was test EXT3 (~190MB/s) and XFS 
> (~320MB/s)
> on the Storage Array. I think what I'm getting here is raw throughput, 
> since I am not
> monitoring in the OS, but at the Fibrechannel Switch itself..
> 
I was thinking of just testing the block device without any fs on it.
That would give you an absolute max figure. However, bearing in mind the
similarities between the GFS2 on-disk layout and ext3, I would expect
the performance to be closer (on a single node basis) to that then to
XFS. There is always going to be some overhead relating to using a
cluster filesystem, so that single node tests will be slower. Having
said that, there shouldn't be a huge gap and the scaling wrt the number
of nodes that you are looking for should be achievable.

> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be 
> able to do that
> tomorrow morning, then I can post the numbers here.
> 
Ok. That would be interesting. Thanks,

Steve.

> Thanks!
> 
> Steven Whitehouse wrote:
> > Hi,
> >
> > On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
> >   
> >> Hello!
> >>
> >> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
> >>
> >> At the moment, the storage subsystem consists of an HP MSA2312
> >> Fibrechannel SAN linked to an FC 8gbit switch. Three client machines
> >> are connected to that switch over 8gbit FC. The disks themselves are
> >> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
> >>
> >> Now, the whole storage shall be shared (single filesystem), here GFS
> >> comes in.
> >>
> >> The Cluster is only 3 nodes large at the moment, more nodes will be
> >> added later on. I am currently testing GFS1 and GFS2 for performance.
> >> Lock Management is done over single 1Gbit Ethernet Links (1 per
> >> machine).
> >>
> >> Thing is, with GFS1 I get far better performance than with the newer
> >> GFS2 across the board, with a few tunable parameters set, for writes
> >> GFS1 is roughly twice as fast.
> >>
> >>     
> > What tests are you running? GFS2 is generally faster than GFS1 except
> > for streaming writes, which is an area that we are putting some effort
> > into solving currently. Small writes (one fs block (4k default) or less)
> > on GFS2 are much faster than on GFS1.
> >
> >   
> >> But, concurrent reads are totally abysmal. The total write performance
> >> (all nodes combined) sits around 280-330Mbyte/sec, whereas the
> >> READ performance is as low as 30-40Mbyte/sec when doing concurrent
> >> reads. Surprisingly, single-node read is somewhat ok at 180Mbyte/sec,
> >> but as soon as several nodes are reading from GFS (version 1 at the
> >> moment) at the same time,  things turn ugly.
> >>
> >>     
> > Reads on GFS2 should be much faster than GFS1, so it sounds as if
> > something isn't working correctly for some reason. For cached data,
> > reads on GFS2 should be as fast as ext2/3 since the code path is
> > identical (to the page cache) and only changes if pages are not cached.
> > GFS1 does its locking at a higher level, so there will be more overhead
> > for cached reads in general.
> >
> > Do make sure that if you are preparing the test files for reading all
> > from one node (or even just a different node to that on which you sre
> > running the read tests) that you need to sync them to disk on that node
> > before starting the tests to avoid issues with caching.
> >
> >   
> >> This is strange, because for writes, global performance across the
> >> cluster increases slightly when adding more nodes. But for reads, the
> >> oppsite seems to be true.
> >>
> >> For read and write tests, separate testfiles were created and read for
> >> each node, with each testfile sitting in its own subdirectory, so no
> >> node would access another nodes file.
> >>
> >>     
> > That sounds like a good test set up to me.
> >
> >   
> >> GFS1 created with the following mkfs.gfs parameters:
> >> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
> >> (4kB blocksite, 16 * 128MB journals, 2GB resource groups,
> >> Distributed LockManager)
> >>
> >> Mount Options set: "noatime,nodiratime,noquota"
> >>
> >> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
> >> demote_secs 20"
> >>     
> > You shouldn't normally need to set the glock_purge and demote_secs to
> > anything other than the default. These settings no longer exist in GFS2
> > since it makes use of the shrinker subsystem provided by the VM and is
> > auto-tuning. If your workload is metadata heavy, you could try boosting
> > the journal size and/or the incore_log_blocks tunable.
> >
> >   
> >> Also, in /etc/cluster/cluster.conf, I added this:
> >> <dlm plock_ownership="1" plock_rate_limit="0"/>
> >> <gfs_controld plock_rate_limit="0"/>
> >>
> >> Any ideas on how to figure out what's going wrong, and how to
> >> tune GFS1 for better concurrent read performance, or tune GFS2
> >> in general to be competitive/better than GFS1?
> >>
> >> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially
> >> and somewhat good reaction times while under heavy sequential
> >> and/or random load. But for now, I just wanna get the seq reading
> >> to work acceptably fast.
> >>
> >> Thanks a lot for your help!
> >>
> >>     
> > Can you try doing some I/O direct to the block device so that we can get
> > an idea of what the raw device can manage? Using dd both read and write,
> > across the nodes (different disk locations on each node to simulate
> > different files).
> >
> > I'm wondering if the problem might be due to the seek pattern generated
> > by the multiple read locations,
> >
> > Steve.
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >