[Linux-cluster] gfs tuning

Thu Jun 19 16:03:42 UTC 2008

On Thu, Jun 19, 2008 at 10:42 AM, Wendy Cheng <s.wendy.cheng at gmail.com> wrote:
> Wendy Cheng wrote:
>>
>> Terry wrote:
>>>
>>> On Tue, Jun 17, 2008 at 5:22 PM, Terry <td3201 at gmail.com> wrote:
>>>
>>>>
>>>> On Tue, Jun 17, 2008 at 3:09 PM, Wendy Cheng <s.wendy.cheng at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi, Terry,
>>>>>
>>>>>>
>>>>>> I am still seeing some high load averages.  Here is an example of a
>>>>>> gfs configuration.  I left statfs_fast off as it would not apply to
>>>>>> one of my volumes for an unknown reason.  Not sure that would have
>>>>>> helped anyways.  I do, however, feel that reducing scand_secs helped a
>>>>>> little:
>>>>>>
>>>>>>
>>>>>
>>>>> Sorry I missed scand_secs (was mindless as the brain was mostly
>>>>> occupied by
>>>>> day time work).
>>>>>
>>>>> To simplify the view, glock states include exclusive (write), share
>>>>> (read),
>>>>> and not-locked (in reality, there are more). Exclusive lock has to be
>>>>> demoted (demote_secs) to share, then to not-locked (another
>>>>> demote_secs)
>>>>> before it is scanned (every scand_secs) to get added into reclaim list
>>>>> where
>>>>> it can be purged. Between exclusive and share state transition, the
>>>>> file
>>>>> contents need to get flushed to disk (to keep file content cluster
>>>>> coherent).  All of above assume the file (protected by this glock) is
>>>>> not
>>>>> accessed (idle).
>>>>>
>>>>> You hit an area that GFS normally doesn't perform well. With GFS1 in
>>>>> maintenance mode while GFS2 seems to be so far away, ext3 could be a
>>>>> better
>>>>> answer. However, before switching, do make sure to test it thoroughly
>>>>> (since
>>>>> Ext3 could have the very same issue as well - check out:
>>>>> http://marc.info/?l=linux-nfs&m=121362947909974&w=2 ).
>>>>>
>>>>> Did you look (and test) GFS "nolock" protocol (for single node GFS)? It
>>>>> bypasses some locking overhead and can be switched to  DLM in the
>>>>> future
>>>>> (just make sure you reserve enough journal space - the rule of thumb is
>>>>> one
>>>>> journal per node and know how many nodes you plan to have in the
>>>>> future).
>>>>>
>>>>> -- Wendy
>>>>>
>>>>
>>>> Good points.  I could try the nolock feature I suppose.  Not quite
>>>> clear on how to reserve journal space.  I forgot to post the cpu time,
>>>> check out this:
>>>>
>>>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>  4822 root      10  -5     0    0    0 S    1  0.0   2159:15 dlm_recv
>>>>  4820 root      10  -5     0    0    0 S    1  0.0 368:09.34 dlm_astd
>>>>  4821 root      10  -5     0    0    0 S    0  0.0 153:06.80 dlm_scand
>>>>  3659 root      10  -5     0    0    0 S    0  0.0 134:40.14 scsi_wq_4
>>>>  4823 root      11  -5     0    0    0 S    1  0.0 109:33.33 dlm_send
>>>>  367 root      10  -5     0    0    0 S    0  0.0 103:33.74 kswapd0
>>>>
>>>> gfs_glockd is further below so not so concerned with that right now.
>>>> It appears turning on nolock would do the trick.  The times aren't
>>>> extremely accurate because I have failed this cluster between nodes
>>>> while testing.
>>>>
>>>>
>>>
>>> Here is some more testing information....
>>>
>>> I created a new volume on my iscsi san of 1 TB and formatted it for
>>> ext3. I then used dd to create a 100G file.  This yielded roughly 900
>>> Mb/sec.  I then stopped my application and did the same thing with an
>>> existing GFS volume.  This gave me about 850 Kb/sec.  This isn't an
>>> iscsi issue.  This appears to be a load issue and the number of I/O
>>> occurring on these volumes.  That said, I would expect that performing
>>> the changes I did would result in a major performance improvement.
>>> Since it didn't, what are my other points I could consider?   If its a
>>> GFS issue, ext3 is the way to go.  Maybe even switch to using
>>> active-active on my NFS cluster.   If its a backend disk issue, I
>>> would expect to see the throughput on my iscsi link (bond1) be fully
>>> utilized.  Its not.  Could I be thrashing the disks?  This is an iscsi
>>> san with 30 sata disks.  Just bouncing some thoughts around to see if
>>> anyone has any more thoughts.
>>>
>>>
>>
>> Really need to focus on my day time job - its worload has been climbing
>> ... but can't help to place a quick comment here ..
>>
>> The 900 MB/s vs. 850 KB/s difference looks like a caching  issue - that
>> is, for 900 MB/s, it looks like the data was still lingering in the system
>> cache while in 850 KB/s case, the data might already hit disk. Cluster
>> filesystem normally syncs more by its nature. In general, ext3 does perform
>> better in single node environment but the difference should not be as big as
>> above.
>> There are certainly more tuning knobs available (such as journal size
>> and/or network buffer size) to make GFS-iscsi "dd" run better but it is
>> pointless. To deploy a cluster filesystem for production usage, the tuning
>> should not be driven by such a simple-mind command. You also have to
>> consider the support issues when deploying a filesystem. GFS1 is a little
>> bit out of date and any new development and/or significant performance
>> improvements would likely be in GFS2, not in GFS1. Research GFS2 (googling
>> to see how other people said about it) to understand whether its direction
>> fits your need (so you can migrate from GFS1 to GFS2 if you bump into any
>> show stopper in the future). If not, ext3 (with ext4 actively developed) is
>> a fine choice if I read your configuration right from previous posts.
>>
> Or .. there is a known GFS1 writepage issue if most of your files are all
> very big .. The problem is fixed in RHEL kernels though. What is your kernel
> version ?
>
> -- Wendy

2.6.18-92.el5

The files are not all very big though.  Varies.