[Linux-cluster] gfs tuning

Thu Jun 19 14:49:17 UTC 2008

Terry wrote:
> On Tue, Jun 17, 2008 at 5:22 PM, Terry <td3201 at gmail.com> wrote:
>   
>> On Tue, Jun 17, 2008 at 3:09 PM, Wendy Cheng <s.wendy.cheng at gmail.com> wrote:
>>     
>>> Hi, Terry,
>>>       
>>>> I am still seeing some high load averages.  Here is an example of a
>>>> gfs configuration.  I left statfs_fast off as it would not apply to
>>>> one of my volumes for an unknown reason.  Not sure that would have
>>>> helped anyways.  I do, however, feel that reducing scand_secs helped a
>>>> little:
>>>>
>>>>         
>>> Sorry I missed scand_secs (was mindless as the brain was mostly occupied by
>>> day time work).
>>>
>>> To simplify the view, glock states include exclusive (write), share (read),
>>> and not-locked (in reality, there are more). Exclusive lock has to be
>>> demoted (demote_secs) to share, then to not-locked (another demote_secs)
>>> before it is scanned (every scand_secs) to get added into reclaim list where
>>> it can be purged. Between exclusive and share state transition, the file
>>> contents need to get flushed to disk (to keep file content cluster
>>> coherent).  All of above assume the file (protected by this glock) is not
>>> accessed (idle).
>>>
>>> You hit an area that GFS normally doesn't perform well. With GFS1 in
>>> maintenance mode while GFS2 seems to be so far away, ext3 could be a better
>>> answer. However, before switching, do make sure to test it thoroughly (since
>>> Ext3 could have the very same issue as well - check out:
>>> http://marc.info/?l=linux-nfs&m=121362947909974&w=2 ).
>>>
>>> Did you look (and test) GFS "nolock" protocol (for single node GFS)? It
>>> bypasses some locking overhead and can be switched to  DLM in the future
>>> (just make sure you reserve enough journal space - the rule of thumb is one
>>> journal per node and know how many nodes you plan to have in the future).
>>>
>>> -- Wendy
>>>       
>> Good points.  I could try the nolock feature I suppose.  Not quite
>> clear on how to reserve journal space.  I forgot to post the cpu time,
>> check out this:
>>
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  4822 root      10  -5     0    0    0 S    1  0.0   2159:15 dlm_recv
>>  4820 root      10  -5     0    0    0 S    1  0.0 368:09.34 dlm_astd
>>  4821 root      10  -5     0    0    0 S    0  0.0 153:06.80 dlm_scand
>>  3659 root      10  -5     0    0    0 S    0  0.0 134:40.14 scsi_wq_4
>>  4823 root      11  -5     0    0    0 S    1  0.0 109:33.33 dlm_send
>>  367 root      10  -5     0    0    0 S    0  0.0 103:33.74 kswapd0
>>
>> gfs_glockd is further below so not so concerned with that right now.
>> It appears turning on nolock would do the trick.  The times aren't
>> extremely accurate because I have failed this cluster between nodes
>> while testing.
>>
>>     
>
> Here is some more testing information....
>
> I created a new volume on my iscsi san of 1 TB and formatted it for
> ext3. I then used dd to create a 100G file.  This yielded roughly 900
> Mb/sec.  I then stopped my application and did the same thing with an
> existing GFS volume.  This gave me about 850 Kb/sec.  This isn't an
> iscsi issue.  This appears to be a load issue and the number of I/O
> occurring on these volumes.  That said, I would expect that performing
> the changes I did would result in a major performance improvement.
> Since it didn't, what are my other points I could consider?   If its a
> GFS issue, ext3 is the way to go.  Maybe even switch to using
> active-active on my NFS cluster.   If its a backend disk issue, I
> would expect to see the throughput on my iscsi link (bond1) be fully
> utilized.  Its not.  Could I be thrashing the disks?  This is an iscsi
> san with 30 sata disks.  Just bouncing some thoughts around to see if
> anyone has any more thoughts.
>
>   
Really need to focus on my day time job - its worload has been climbing 
... but can't help to place a quick comment here ..

The 900 MB/s vs. 850 KB/s difference looks like a caching  issue - that 
is, for 900 MB/s, it looks like the data was still lingering in the 
system cache while in 850 KB/s case, the data might already hit disk. 
Cluster filesystem normally syncs more by its nature. In general, ext3 
does perform better in single node environment but the difference should 
not be as big as above. 

There are certainly more tuning knobs available (such as journal size 
and/or network buffer size) to make GFS-iscsi "dd" run better but it is 
pointless. To deploy a cluster filesystem for production usage, the 
tuning should not be driven by such a simple-mind command. You also have 
to consider the support issues when deploying a filesystem. GFS1 is a 
little bit out of date and any new development and/or significant 
performance improvements would likely be in GFS2, not in GFS1. Research 
GFS2 (googling to see how other people said about it) to understand 
whether its direction fits your need (so you can migrate from GFS1 to 
GFS2 if you bump into any show stopper in the future). If not, ext3 
(with ext4 actively developed) is a fine choice if I read your 
configuration right from previous posts.

-- Wendy