[Linux-cluster] dlm and IO speed problem <er, might wanna get a coffee first ; )>

Sat Apr 12 04:16:52 UTC 2008

Kadlecsik Jozsef wrote:
> On Thu, 10 Apr 2008, Kadlecsik Jozsef wrote:
>
>   
>> But this is a good clue to what might bite us most! Our GFS cluster is an 
>> almost mail-only cluster for users with Maildir. When the users experience 
>> temporary hangups for several seconds (even when writing a new mail), it 
>> might be due to the concurrent scanning for a new mail on one node by the 
>> MUA and the delivery to the Maildir in another node by the MTA.
>>     

I personally don't know much about mail server. But if anyone can 
explain more about what these two processes (?) do, say, how does that 
"MTA" deliver its mail (by "rename" system call ?) and/or how mails are 
moved from which node to where, we may have a better chance to figure 
this puzzle out.

Note that "rename" system call is normally very expensive. Minimum 4 
exclusive locks are required (two directory locks, one file lock for 
unlink, one file lock for link), plus resource group lock if block 
allocation is required. There are numerous chances for deadlocks if not 
handled carefully. The issue is further worsen by the way GFS1 does its 
lock ordering - it obtains multiple locks based on lock name order. Most 
of the locknames are taken from inode number so their sequence always 
quite random. As soon as lock contention occurs, lock requests will be 
serialized to avoid deadlocks. So this may be a cause for these spikes 
where "rename"(s) are struggling to get lock order straight. But I don't 
know for sure unless someone explains how email server does its things. 
BTW, GFS2 has relaxed this lock order issue so it should work better.

I'm having a trip (away from internet) but I'm interested to know this 
story... Maybe by the time I get back on my laptop, someone has figured 
this out. But please do share the story :) ...

-- Wendy

>> What is really strange (and distrurbing) that such "hangups" can take 
>> 10-20 seconds which is just too much for the users.
>>     
>
> Yesterday we started to monitor the number of locks/held locks on two of 
> the machines. The results from the first day can be found at 
> http://www.kfki.hu/~kadlec/gfs/.
>
> It looks as Maildir is definitely a wrong choice for GFS and we should 
> consider to convert to mailbox format: at least I cannot explain the 
> spikes in another way.
>  
>   
>> In order to look at the possible tuning options and the side effects, I 
>> list what I have learned so far:
>>
>> - Increasing glock_purge (percent, default 0) helps to trim back the 
>>   unused glocks by gfs_scand itself. Otherwise glocks can accumulate and 
>>   gfs_scand eats more and more time at scanning the larger and 
>>   larger table of glocks.
>> - gfs_scand wakes up every scand_secs (default 5s) to scan the glocks,  
>>   looking for work to do. By increasing scand_secs one can lessen the load 
>>   produced by gfs_scand, but it'll hurt because flushing data can be 
>>   delayed.
>> - Decreasing demote_secs (seconds, default 300) helps to flush cached data
>>   more often by moving write locks into less restricted states. Flushing 
>>   often helps to avoid burstiness *and* to prolong another nodes' 
>>   lock access. Question is, what are the side effects of small
>>   demote_secs values? (Probably there is no much point to choose
>>   smaller demote_secs value than scand_secs.)
>>
>> Currently we are running with 'glock_purge = 20' and 'demote_secs = 30'.
>>     
>
> Best regards,
> Jozsef
> --
> E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
> PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
> Address: KFKI Research Institute for Particle and Nuclear Physics
>          H-1525 Budapest 114, POB. 49, Hungary
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>