[Linux-cluster] dlm and IO speed problem <er, might wanna get a coffee first ; )>

Tue Apr 8 02:36:46 UTC 2008

Hi everyone,

I have a couple of questions about the tuning the dlm and gfs that
hopefully someone can help me with.

my setup:
6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
not new stuff, but corporate standards dictated the rev of rhat.

The cluster is a developer build cluster, where developers login, and
are balanced across nodes and edit and compile code. They can access via
vnc, XDMCP, ssh and telnet, and nodes external to the cluster can mount
the gfs home via nfs, balanced through the director. Their homes are on
the gfs, and accessible on all nodes.

I'm noticing huge differences in compile times - or any home file access
really - when doing stuff in the same home directory on the gfs on
different nodes. For instance, the same compile on one node is ~12
minutes - on another it's 18 minutes or more (not running concurrently).
I'm also seeing weird random pauses in writes, like saving a file in vi,
what would normally take less than a second, may take up to 10 seconds.

* From reading, I see that the first node to access a directory will be
the lock master for that directory. How long is that node the master? If
the user is no longer 'on' that node, is it still the master? If
continued accesses are remote, will the master state migrate to the node
that is primarily accessing it? I've set LVS persistence for ssh and
telnet for 5 minutes, to allow multiple xterms fired up in a script to
land on the same node, but new ones later will land on a different node
- by design really. Do I need to make this persistence way longer to
keep people only on the first node they hit? That kind of horks my load
balancing design if so. How can I see which node is master for which
directories? Is there a table I can read somehow?

* I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
mount noatime,noquota,nodiratime, and David Teigland recommended I set
dlm_dropcount to '0' today on irc, which I did, and I see an improvement
in speed on the node that appears to be master for say 'find' command
runs on the second and subsequent runs of the command if I restart them
immediately, but on the other nodes the speed is awful - worse than nfs
would be. On the first run of a find, or If I wait >10 seconds to start
another run after the last run completes, the time to run is
unbelievably slower than the same command on a standalone box with ext3.
e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
a different node it can take over 2 minutes! Yet an immediate re-run on
the cluster, on what I think must be the master is sub-second. How can I
speed up the first access time, and how can I keep the speed up similar
to immediate subsequent runs. I've got a ton of memory - I just do not
know which knobs to turn.

Am I expecting too much from gfs? Did I oversell it when I literally
fought to use it rather than nfs off the NetApp filer, insisting that
the performance of gfs smoked nfs? Or, more likely, do I just not
understand how to optimize it fully for my application?

Regards and Thanks,
-C