[Linux-cluster] gfs tuning
Wendy Cheng
s.wendy.cheng at gmail.com
Thu Jun 19 15:42:17 UTC 2008
Wendy Cheng wrote:
> Terry wrote:
>> On Tue, Jun 17, 2008 at 5:22 PM, Terry <td3201 at gmail.com> wrote:
>>
>>> On Tue, Jun 17, 2008 at 3:09 PM, Wendy Cheng
>>> <s.wendy.cheng at gmail.com> wrote:
>>>
>>>> Hi, Terry,
>>>>
>>>>> I am still seeing some high load averages. Here is an example of a
>>>>> gfs configuration. I left statfs_fast off as it would not apply to
>>>>> one of my volumes for an unknown reason. Not sure that would have
>>>>> helped anyways. I do, however, feel that reducing scand_secs
>>>>> helped a
>>>>> little:
>>>>>
>>>>>
>>>> Sorry I missed scand_secs (was mindless as the brain was mostly
>>>> occupied by
>>>> day time work).
>>>>
>>>> To simplify the view, glock states include exclusive (write), share
>>>> (read),
>>>> and not-locked (in reality, there are more). Exclusive lock has to be
>>>> demoted (demote_secs) to share, then to not-locked (another
>>>> demote_secs)
>>>> before it is scanned (every scand_secs) to get added into reclaim
>>>> list where
>>>> it can be purged. Between exclusive and share state transition, the
>>>> file
>>>> contents need to get flushed to disk (to keep file content cluster
>>>> coherent). All of above assume the file (protected by this glock)
>>>> is not
>>>> accessed (idle).
>>>>
>>>> You hit an area that GFS normally doesn't perform well. With GFS1 in
>>>> maintenance mode while GFS2 seems to be so far away, ext3 could be
>>>> a better
>>>> answer. However, before switching, do make sure to test it
>>>> thoroughly (since
>>>> Ext3 could have the very same issue as well - check out:
>>>> http://marc.info/?l=linux-nfs&m=121362947909974&w=2 ).
>>>>
>>>> Did you look (and test) GFS "nolock" protocol (for single node
>>>> GFS)? It
>>>> bypasses some locking overhead and can be switched to DLM in the
>>>> future
>>>> (just make sure you reserve enough journal space - the rule of
>>>> thumb is one
>>>> journal per node and know how many nodes you plan to have in the
>>>> future).
>>>>
>>>> -- Wendy
>>>>
>>> Good points. I could try the nolock feature I suppose. Not quite
>>> clear on how to reserve journal space. I forgot to post the cpu time,
>>> check out this:
>>>
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>> 4822 root 10 -5 0 0 0 S 1 0.0 2159:15 dlm_recv
>>> 4820 root 10 -5 0 0 0 S 1 0.0 368:09.34 dlm_astd
>>> 4821 root 10 -5 0 0 0 S 0 0.0 153:06.80 dlm_scand
>>> 3659 root 10 -5 0 0 0 S 0 0.0 134:40.14 scsi_wq_4
>>> 4823 root 11 -5 0 0 0 S 1 0.0 109:33.33 dlm_send
>>> 367 root 10 -5 0 0 0 S 0 0.0 103:33.74 kswapd0
>>>
>>> gfs_glockd is further below so not so concerned with that right now.
>>> It appears turning on nolock would do the trick. The times aren't
>>> extremely accurate because I have failed this cluster between nodes
>>> while testing.
>>>
>>>
>>
>> Here is some more testing information....
>>
>> I created a new volume on my iscsi san of 1 TB and formatted it for
>> ext3. I then used dd to create a 100G file. This yielded roughly 900
>> Mb/sec. I then stopped my application and did the same thing with an
>> existing GFS volume. This gave me about 850 Kb/sec. This isn't an
>> iscsi issue. This appears to be a load issue and the number of I/O
>> occurring on these volumes. That said, I would expect that performing
>> the changes I did would result in a major performance improvement.
>> Since it didn't, what are my other points I could consider? If its a
>> GFS issue, ext3 is the way to go. Maybe even switch to using
>> active-active on my NFS cluster. If its a backend disk issue, I
>> would expect to see the throughput on my iscsi link (bond1) be fully
>> utilized. Its not. Could I be thrashing the disks? This is an iscsi
>> san with 30 sata disks. Just bouncing some thoughts around to see if
>> anyone has any more thoughts.
>>
>>
> Really need to focus on my day time job - its worload has been
> climbing ... but can't help to place a quick comment here ..
>
> The 900 MB/s vs. 850 KB/s difference looks like a caching issue -
> that is, for 900 MB/s, it looks like the data was still lingering in
> the system cache while in 850 KB/s case, the data might already hit
> disk. Cluster filesystem normally syncs more by its nature. In
> general, ext3 does perform better in single node environment but the
> difference should not be as big as above.
> There are certainly more tuning knobs available (such as journal size
> and/or network buffer size) to make GFS-iscsi "dd" run better but it
> is pointless. To deploy a cluster filesystem for production usage, the
> tuning should not be driven by such a simple-mind command. You also
> have to consider the support issues when deploying a filesystem. GFS1
> is a little bit out of date and any new development and/or significant
> performance improvements would likely be in GFS2, not in GFS1.
> Research GFS2 (googling to see how other people said about it) to
> understand whether its direction fits your need (so you can migrate
> from GFS1 to GFS2 if you bump into any show stopper in the future). If
> not, ext3 (with ext4 actively developed) is a fine choice if I read
> your configuration right from previous posts.
>
Or .. there is a known GFS1 writepage issue if most of your files are
all very big .. The problem is fixed in RHEL kernels though. What is
your kernel version ?
-- Wendy
More information about the Linux-cluster
mailing list