[Linux-cluster] Slowness above 500 RRDs

Wed Jun 13 15:01:21 UTC 2007

On Wed, Jun 13, 2007 at 04:38:40PM +0200, Ferenc Wagner wrote:
> David Teigland <teigland at redhat.com> writes:
> 
> >>>> But looks like nodeA feels obliged to communicate its locking
> >>>> process around the cluster.
> >>>
> >>> I'm not sure what you mean here.  To see the amount of dlm locking traffic
> >>> on the network, look at port 21064.  There should be very little in the
> >>> test above... and the dlm locking that you do see should mostly be related
> >>> to file i/o, not flocks.
> >> 
> >> There was much traffic on port 21064.  Possibly related to file I/O
> >> and not flocks, I can't tell.  But that's agrees with my speculation,
> >> that it's not the explicit [pf]locks that take much time, but
> >> something else.
> >
> > Could you comment the fcntl/flock calls out of the application entirely
> > and try it?
> 
> Let's see.  A typical test run looks like this (first with fcntl
> locking; tcpdump slows down the first iteration from about 6 s):
> 
> filecount=500
>   iteration=0 elapsed time=20.196318 s
>   iteration=1 elapsed time=0.323969 s
>   iteration=2 elapsed time=0.319929 s
>   iteration=3 elapsed time=0.361738 s
>   iteration=4 elapsed time=0.399365 s
> total elapsed time=21.601319 s
> 
> During the first (slow) iteration, there's much traffic on port 21064.
> During the next (fast) iterations there's no traffic at all on that port.
> If I rerun the test immediately, there's still no traffic.
> 5 minutes later, without any action on my part, there's a couple of
> packets again, then 20 s later a bigger bunch (around 30).
> After this, the first iteration generates much traffic again, GOTO 10.
> 
> If I use flock instead, the beginning is similar, but after about 10 s
> from the finish of the test, some small traffic appears by itself, and
> if I rerun the test after this, it generates traffic again, although
> much less than after 5 minutes.  The traffic generated 5 minutes after
> the test run consists of a couple of packets followed by a much bigger
> bunch 5 s later.
> 
> If I don't use any locking at all, then the situation is the same as
> with fcntl locking, but the "automatic" traffic consist of a small
> burst (couple of packets) 4 min 51 s after the finish, then about 30
> packets 25 s later.
> 
> Does it tell you anything?  The timings are perhaps somewhat off
> because of the 20 s runtime.  If you can make some sense out of this,

It sounds pretty normal, I'd need to repeat the test myself to figure out
exactly what's happening.  The 10 sec is probably toss_secs from the dlm;
you can increase with: echo 20 >> /sys/kernel/config/dlm/cluster/toss_secs

> I'd be glad to hear it.  Also, I'd like to tweak the 5 minutes
> timeout, where does it come from?  Is it settable by sysfs or
> gfs_tool?

gfs_tool gettune <mountpoint> | grep demote_secs

should show 300, to increase:

gfs_tool settune <mountpoint> demote_secs <value>

Dave