[Linux-cluster] Slowness above 500 RRDs

Wed Jun 13 14:38:40 UTC 2007

David Teigland <teigland at redhat.com> writes:

>>>> But looks like nodeA feels obliged to communicate its locking
>>>> process around the cluster.
>>>
>>> I'm not sure what you mean here.  To see the amount of dlm locking traffic
>>> on the network, look at port 21064.  There should be very little in the
>>> test above... and the dlm locking that you do see should mostly be related
>>> to file i/o, not flocks.
>> 
>> There was much traffic on port 21064.  Possibly related to file I/O
>> and not flocks, I can't tell.  But that's agrees with my speculation,
>> that it's not the explicit [pf]locks that take much time, but
>> something else.
>
> Could you comment the fcntl/flock calls out of the application entirely
> and try it?

Let's see.  A typical test run looks like this (first with fcntl
locking; tcpdump slows down the first iteration from about 6 s):

filecount=500
  iteration=0 elapsed time=20.196318 s
  iteration=1 elapsed time=0.323969 s
  iteration=2 elapsed time=0.319929 s
  iteration=3 elapsed time=0.361738 s
  iteration=4 elapsed time=0.399365 s
total elapsed time=21.601319 s

During the first (slow) iteration, there's much traffic on port 21064.
During the next (fast) iterations there's no traffic at all on that port.
If I rerun the test immediately, there's still no traffic.
5 minutes later, without any action on my part, there's a couple of
packets again, then 20 s later a bigger bunch (around 30).
After this, the first iteration generates much traffic again, GOTO 10.

If I use flock instead, the beginning is similar, but after about 10 s
from the finish of the test, some small traffic appears by itself, and
if I rerun the test after this, it generates traffic again, although
much less than after 5 minutes.  The traffic generated 5 minutes after
the test run consists of a couple of packets followed by a much bigger
bunch 5 s later.

If I don't use any locking at all, then the situation is the same as
with fcntl locking, but the "automatic" traffic consist of a small
burst (couple of packets) 4 min 51 s after the finish, then about 30
packets 25 s later.

Does it tell you anything?  The timings are perhaps somewhat off
because of the 20 s runtime.  If you can make some sense out of this,
I'd be glad to hear it.  Also, I'd like to tweak the 5 minutes
timeout, where does it come from?  Is it settable by sysfs or
gfs_tool?
-- 
Thanks,
Feri.