[Linux-cluster] optimising DLM speed?

Tue Feb 15 18:20:20 UTC 2011

Hi,

On Tue, 2011-02-15 at 17:59 +0000, Alan Brown wrote:
> After lots of headbanging, I'm slowly realising that limits on GFS2 lock 
> rates and totem message passing appears to be the main inhibitor of 
> cluster performance.
> 
> Even on disks which are only mounted on one node (using lock_dlm), the 
> ping_pong rate is - quite frankly - appalling, at about 5000 
> locks/second, falling off to single digits when 3 nodes are active on 
> the same directory.
> 
Let me try and explain what is going on here.... the posix (fcntl) locks
which you are using, do not go through the dlm, or at least not the main
part of the dlm.

The lock requests are sent to either gfs_controld or dlm_controld,
depending upon the version of RHCS where the requests are processed in
userspace via corosync/openais.

> totem's defaults are pretty low:
> 
> (from man openais.conf)
> 
> max messages/second = 17
> window_size = 50
> encryption = on
> encryption/decryption threads = 1
> netmtu = 1500
> 
> I suspect tuning these would have a marked effect on performance
> 
> gfs_controld and dlm_controld aren't even appearing in the CPU usage 
> tables (24Gb dual 5560CPUs)
> 
Only one of gfs_controld/dlm_controld will have any part in dealing with
the locks that you are concerned with, depending on the version.

> We have 2 GFS clusters, 2 nodes (imap) and 3 nodes (fileserving)
> 
> The imap system has around 2.5-3 million small files in the Maildir imap 
> tree, whilst the fileserver cluster has ~90 1Tb filesystems of 1-4 
> million files apiece (fileserver total is around 150 million files)
> 
> When things get busy or when users get silly and drop 10,000 files in a 
> directory, performance across the entire cluster goes downhill badly - 
> not just in the affected disk or directory.
> 
> Even worse: backups - it takes 20-28 hours to run a 0 file incremental 
> backup of a 2.1million file system (ext4 takes about 8 minutes for the 
> same file set!)
> 
The issues you've reported here don't sound to me as if they are related
to the rate of posix locks which can be granted. These sound to me a lot
more like issues relating to the I/O pattern on the filesystem.

How is the data spread out across directories and across nodes? Do you
try to keep users local to a single node for the imap servers? Is the
backup just doing a single pass scan over the whole fileystem?

> 
> All heartbeat/lock traffic is handled across a dedicated Gb switch with 
> each cluster in its own vlan to ensure no external cruft gets in to 
> cause problems.
> 
> I'm seeing heartbeat/lock lan traffic peak out at about 120kb/s and 
> 4000pps per node at the moment. Clearly the switch isn't the problem - 
> and using hardware acclerated igb devices I'm pretty sure the 
> networking's fine too.
> 
During the actual workload, or just during the ping pong test?

Steve.

> SAN side, there are 4 8Gb Qlogic cards facing the fabric and right now 
> the whole mess talks to a Nexsan atabeast (which is slow, but seldom 
> gets its commmand queue maxed out.)
> 
> Has anyone played much with the totem message timings? if so what 
> results have you had?
> 
> As a comparison, the same hardware using EXT4 on a standalone system can 
> trivially max out multiple 1Gb/s interfaces while transferring 1-2Mb/s 
> files and gives lock rates of 1.8-2.5 million locks/second even with 
> multiple ping_pong processes running.
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster