[Linux-cluster] dlm and IO speed problem <er, might wanna get a coffee first ; )>

Tue Apr 8 10:05:25 UTC 2008

> my setup:
> 6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
> not new stuff, but corporate standards dictated the rev of rhat.
[...]
> I'm noticing huge differences in compile times - or any home file access
> really - when doing stuff in the same home directory on the gfs on
> different nodes. For instance, the same compile on one node is ~12
> minutes - on another it's 18 minutes or more (not running concurrently).
> I'm also seeing weird random pauses in writes, like saving a file in vi,
> what would normally take less than a second, may take up to 10 seconds.
>
> * From reading, I see that the first node to access a directory will be
> the lock master for that directory. How long is that node the master? If
> the user is no longer 'on' that node, is it still the master? If
> continued accesses are remote, will the master state migrate to the node
> that is primarily accessing it? I've set LVS persistence for ssh and
> telnet for 5 minutes, to allow multiple xterms fired up in a script to
> land on the same node, but new ones later will land on a different node
> - by design really. Do I need to make this persistence way longer to
> keep people only on the first node they hit? That kind of horks my load
> balancing design if so. How can I see which node is master for which
> directories? Is there a table I can read somehow?
>
> * I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
> mount noatime,noquota,nodiratime, and David Teigland recommended I set
> dlm_dropcount to '0' today on irc, which I did, and I see an improvement
> in speed on the node that appears to be master for say 'find' command
> runs on the second and subsequent runs of the command if I restart them
> immediately, but on the other nodes the speed is awful - worse than nfs
> would be. On the first run of a find, or If I wait >10 seconds to start
> another run after the last run completes, the time to run is
> unbelievably slower than the same command on a standalone box with ext3.
> e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
> a different node it can take over 2 minutes! Yet an immediate re-run on
> the cluster, on what I think must be the master is sub-second. How can I
> speed up the first access time, and how can I keep the speed up similar
> to immediate subsequent runs. I've got a ton of memory - I just do not
> know which knobs to turn.

It sounds like bumping up lock trimming might help, but I don't think 
the feature accessibility through /sys has been back-ported to RHEL4, so 
if you're stuck with RHEL4, you may have to rebuild the latest versions of 
the tools and kernel modules from RHEL5, or you're out of luck.

> Am I expecting too much from gfs? Did I oversell it when I literally
> fought to use it rather than nfs off the NetApp filer, insisting that
> the performance of gfs smoked nfs? Or, more likely, do I just not
> understand how to optimize it fully for my application?

Probably a combination of all of the above. The main advantage of GFS 
isn't speed, it's the fact that it is a proper POSIX file system, unlike 
NFS or CIFS (e.g. file locking actually works on GFS). It also tends to 
stay consistent if a node fails, due to journalling.

Having said that, I've not seen speed differences as big as what you're 
describing, but I'm using RHEL5. I also have bandwidth charts for my 
DRBD/cluster interface, and the bandwidth usage on a lightly loaded system 
is not really signifficant unless lots of writes start happening. With 
mostly reads (which can all be served from the local DRBD mirror), the 
background "noise" traffic of combined DRBD and RHCS is > 200Kb/s 
(25KB/s). Since the ping times are < 0.1ms, in theory, this should make 
locks take < 1ms to resolve/migrate. Of course, if your find goes over 
50,000 files, the a 50 second delay to migrate all the locks may well be 
in a reasonable ball-park. You may find that things have moved on quite a 
bit since RHEL4...

Gordan