[Linux-cluster] Slowness above 500 RRDs

Sat Apr 21 11:42:45 UTC 2007

David Teigland <teigland at redhat.com> writes:

> On Wed, Mar 28, 2007 at 05:48:10PM +0200, Wagner Ferenc wrote:
>> Ferenc Wagner <wferi at niif.hu> writes:
>> 
>> > There's a good bunch of RRDs in a directory.  A script scans them for
>> > their last modification times, and then updates each in turn for a
>> > couple of times.  The number of files scanned and the length of the
>> > update rounds are printed.  The results are much different for 500 and
>> > 501 files:
>> >
>> > filecount=501
>> >   iteration=0 elapsed time=10.425568 s
>> >   iteration=1 elapsed time= 9.766178 s
>> >   iteration=2 elapsed time=20.14514 s
>> >   iteration=3 elapsed time= 2.991397 s
>> >   iteration=4 elapsed time=20.496422 s
>> > total elapsed time=63.824705 s
>> >
>> > filecount=500
>> >   iteration=0 elapsed time=6.560811 s
>> >   iteration=1 elapsed time=0.229375 s
>> >   iteration=2 elapsed time=0.202973 s
>> >   iteration=3 elapsed time=0.203439 s
>> >   iteration=4 elapsed time=0.203095 s
>> > total elapsed time=7.399693 s
>> 
>> Following up to myself with one more data point: raising
>> SHRINK_CACHE_MAX from 1000 to 20000 in gfs/dlm/lock_dlm.h helps
>> significantly, but still isn't enough.  Besides, I don't know what I'm
>> doing.  Should I tweak the surrounding #defines, too?
>
> SHRINK_CACHE_MAX is related to fcntl posix locks, did you intend to change
> the app to use flock (which is much faster that fcntl)?
>> > SHRINK_CACHE_MAX from 1000 to 20000 in gfs/dlm/lock_dlm.h helps
>
>> SHRINK_CACHE_MAX is related to fcntl posix locks, did you intend to change
>> the app to use flock (which is much faster that fcntl)?
>
>> > significantly, but still isn't enough.  Besides, I don't know what I'm
>> > doing.  Should I tweak the surrounding #defines, too?
>
> lock_dlm caches dlm locks for old plocks for a while in an attempt to
> improve performance and reduce thrashing the dlm -- SHRINK_CACHE_MAX is
> the max level of caching, it's fine to change it as you've done.  The fact
> that you're hitting it, though, indicates that your app is using plocks
> more heavily than gfs/dlm are suited to handle.  Switching to flock will
> obviate all of this.  (Or switching to the new dlm and cluster
> infrastructure which has a completely different and far better approach to
> plocks).
>
> Dave