[Linux-cluster] Slowness above 500 RRDs

Tue Jun 12 14:46:52 UTC 2007

On Tue, Jun 12, 2007 at 04:01:04PM +0200, Ferenc Wagner wrote:
> Hi David,
> 
> Sorry if all what follows is misguided nonsense.  I'm eager to learn...
> 
> David Teigland <teigland at redhat.com> writes:
> 
> > The new code has much better caching in the dlm which will benefit flocks,
> > look at these flock numbers I sent before: [...]
> >
> > This is testing raw flock performance.  The dlm locks for normal file
> > operations should be cached and locally mastered also, so I'm not sure
> > what's causing the long times.  Make sure that drop_count is zero again,
> > now it's in sysfs:
> >   echo 0 > /sys/fs/gfs/<foo>:<bar>/lock_module/drop_count
> >
> > Also, mount debugfs so we can check some stuff later:
> >   mount -t debugfs none /sys/kernel/debug
> >
> > Then run some tests:
> > - mount on nodeA
> > - run the test on nodeA
> > - count locks on nodeA
> >   (cat /sys/kernel/debug/dlm/<bar> | grep Master | wc -l)
> > - mount on nodeB (don't do anything on this node)
> > - run the test again on nodeA
> > - count locks on nodeA and nodeB (see above)
> > - mount on nodeC (don't do anything on nodes B or C)
> > - run the test again on nodeA
> > - count locks on nodes A, B and C (see above)
> >
> > We're basically trying to produce the best-case performance from one node,
> > nodeA.  That means making sure that nodeA is mastering all locks and doing
> > maximum caching.  That's why it's important that we not do anything at all
> > that accesses the fs on nodes B or C, or do any extra mounts/unmounts.
> 
> I made all the above tests and composed the reply a long time ago, but
> now, getting back to it after that long time, I decided to satisfy your
> curiosity, behold...
> 
> > Plocks will be much slower and are probably not interesting to test, but
> > I'm curious if you added the "-l0" option to gfs_controld?  That option
> > turns off the code that intentionally limits the rate of plocks.  See the
> > old results again: [...]
> 
> Now, that switch makes ALL the difference.  With a single node
> switched on, I get results like this (with abbreviated strace -c
> output appended):
> 
> without -l0:
> 
> filecount=500
>   iteration=0 elapsed time=10.444446 s
>   iteration=1 elapsed time=9.693618 s
>   iteration=2 elapsed time=10.520073 s
>   iteration=3 elapsed time=10.521504 s
>   iteration=4 elapsed time=10.520183 s
> total elapsed time=51.699824 s
> Process 5265 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  83.27    0.048525           6      7551           read
>   6.73    0.003923           2      2502           fcntl64
>   4.47    0.002606           1      2528           close
>   3.09    0.001801           1      2551        23 open
>   0.74    0.000432           0      2507           write
>   0.71    0.000415           0      5033           mmap2
>   0.41    0.000237           0     12528         3 _llseek
>   0.31    0.000178           0      5001           munmap
>   0.18    0.000107           0      5015           fstat64
>   0.08    0.000049           0      2506           gettimeofday
>   0.00    0.000000           0        16        14 ioctl
>   0.00    0.000000           0       202       182 stat64
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.058273                 47974       229 total
> 
> with -l0:
> 
> filecount=500
>   iteration=0 elapsed time=5.966146 s
>   iteration=1 elapsed time=0.582058 s
>   iteration=2 elapsed time=0.528272 s
>   iteration=3 elapsed time=0.936438 s
>   iteration=4 elapsed time=0.528147 s
> total elapsed time=8.541061 s
> Process 10030 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  57.17    0.016527           2      7551           read
>  21.49    0.006213           2      2528           close
>   8.16    0.002358           1      2502           fcntl64
>   6.59    0.001904           1      2551        23 open
>   2.21    0.000638           0      2507           write
>   1.46    0.000421           0      5033           mmap2
>   0.86    0.000249         249         1           execve
>   0.73    0.000212           0      5001           munmap
>   0.65    0.000187           0     12528         3 _llseek
>   0.57    0.000165           0      5015           fstat64
>   0.12    0.000034           0      2506           gettimeofday
>   0.00    0.000000           0        16        14 ioctl
>   0.00    0.000000           0       202       182 stat64
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.028908                 47974       229 total
> 
> Looks like the bottleneck isn't the explicit locking (be it plock or
> flock), but something else, like the built-in GFS locking.

I'm guessing that these were run with a single node in the cluster?  The
second set of numbers (with -l0) wouldn't make much sense otherwise.  I
think if you add nodes to the cluster, the -l0 numbers will go up quite a
bit.  In the end I expect that flocks are still going to be the fastest
for you.

> Similar dramatic speedup can be achieved (with a single node switched
> on, again), by the lockproto=lock_nolock mount option, even if used
> together with ignore_local_fs.  It I understand it right, this
> combination leaves the cluster-wide [pf]locks alone, just eliminates
> the GFS internal locking, which guards the internal consistency of the
> file system (please correct me if I'm wrong).

With nolock there is no cluster (lock_nolock just returns 0 for
everything), so the cluster-wide [pf]locks have zero cost.  So this test
doesn't tell you anything.

> What's strange, is that gfs_controld -l0 seems like a perfectly safe
> invocation (what's the catch, ie. why was the artifical limit
> introduced?), 

The rate limit was introduced to prevent bad programs from flooding the
network with plock operations.  It may not be a very real problem, though,
so we might eventually disable it (-l0) by default.

> still it achieves almost the same speedup like using
> lock_nolock, which would be a disaster with more than one node
> mounting the fs.  (Also this trick scales pretty well to 4000 files.)

No, -l0 is not going to give you the performance of nolock.  I think you
must have been running with a single node in the cluster.  In that case
there are no other nodes to send/recv messages to/from, so the plock
messages are very fast.

> Again, the above tests were done with a single node switched on, and
> I'm not sure whether the results carry over to the real cluster setup,
> will test is soon.  

Ah, yep.  When you add nodes the plocks will become much slower.  Again, I
think you'll have better luck with flocks.

> I didn't touch drop_count either, everything was
> left as default, except for the mount options and the -l option.
> 
> Also, I can send the results of the scenario suggested by you, if it's
> still relevant.  In short: the locks are always mastered on node A
> only, but the performance is poor nevertheless.

Poor even in the first step when you're just mounting on nodeA?

Dave