[Linux-cluster] dlm and IO speed problem <er, might wanna get a coffee first ; )>

Thu Apr 10 13:00:40 UTC 2008

On Wed, 9 Apr 2008, Wendy Cheng wrote:

> > What led me to suspect clashing in the hash (or some other lock-creating
> > issue) was the simple test I made on our five node cluster: on one node I
> > ran
> >
> > find /gfs -type f -exec cat {} > /dev/null \;
> >
> > and on another one just started an editor, naming a non-existent file.
> > It took multiple seconds while the editor "opened" the file. What else than
> > creating the lock could delay the process so long?
> >   
> 
> Not knowing how "find" is implemented, I would guess this is caused by
> directory locks. Creating a file needs a directory lock. Your exclusive write
> lock (file create) can't be granted until the "find" releases the directory
> lock. It doesn't look like a lock query performance issue to me.

As /gfs is a large directory structure with hundreds of user home 
directories, somehow I don't think I could pick the same directory which 
was just processed by "find".

But this is a good clue to what might bite us most! Our GFS cluster is an 
almost mail-only cluster for users with Maildir. When the users experience 
temporary hangups for several seconds (even when writing a new mail), it 
might be due to the concurrent scanning for a new mail on one node by the 
MUA and the delivery to the Maildir in another node by the MTA.

What is really strange (and distrurbing) that such "hangups" can take 
10-20 seconds which is just too much for the users.

In order to look at the possible tuning options and the side effects, I 
list what I have learned so far:

- Increasing glock_purge (percent, default 0) helps to trim back the 
  unused glocks by gfs_scand itself. Otherwise glocks can accumulate and 
  gfs_scand eats more and more time at scanning the larger and 
  larger table of glocks.
- gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,  
  looking for work to do. By increasing scand_secs one can lessen the load 
  produced by gfs_scand, but it'll hurt because flushing data can be 
  delayed.
- Decreasing demote_secs (seconds, default 300) helps to flush cached data
  more often by moving write locks into less restricted states. Flushing 
  often helps to avoid burstiness *and* to prolong another nodes' 
  lock access. Question is, what are the side effects of small
  demote_secs values? (Probably there is no much point to choose
  smaller demote_secs value than scand_secs.)

Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.

> > But 'flushing when releasing glock' looks as a side effect. I mean, isn't
> > there a more direct way to control the flushing?
> 
> To make long story short, I did submit a direct cache flush patch first,
> instead of this final version of lock trimming patch. Unfortunately, it was
> *rejected*.

I see. Another question, just out of curiosity: why don't you use kernel 
timers for every glock instead of gfs_scand? The hash bucket id of the 
glock should be added to struct gfs_glock, but the timer function could be 
almost identical with scan_glock. As far as I see the only drawback were 
that it'd be equivalent with 'glock_purge = 100' and it'd be tricky to 
emulate glock_purge != 100 settings.

Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary